Chapter 3 Data transformation
The dataset has a .RData
version, which we can directly load using R, therefore, we do not have to worry much about reading the data into R. However, we would like to change the data class of several columns to fit our study better.
In the original data, all columns with fully numeric values will be considered as “numeric.” For example, the column year
would be considered as numeric, and we don’t need to transform it further. However, the type_of_violence
column will also be considered as numeric, in which it should be categorical. Therefore, we change the 1, 2, and 3 values in type_of_violence
to categories – “state-based conflict”, “non-state conflict”, and " one-sided violence." Further, we would like to change the date_start
and date_end
to the DateTime class.
The dataset we chose is very larger, to save the space, we would like to reduce the dataset only with columns we are might be interested in (as described in Chapter 2). The transformed dataset is stored in “data/clean/df_process.RData”. But in certain instances, we may want to use the original dataset (for example, missing value analysis).
In our study, we find the most extreme values of this dataset are related to the Rwanda genocide in 1994, which we will have detailed explanations in the following chapters. Sometimes, we need to include these records in our analysis, and sometimes we don’t. Therefore, we decide to have a data frame to record the unique conflict event ID of records related to the Rwanda genocide. Then we can quickly include and filter our these records in our future coding. The data frame is stored as “data/clean/df_rwa.RData”.
For other transformations made for analysis and visualization, please refer to our code on GitHub repo.
In our project, we would also like to investigate the geographic or spatial pattern of the data via the D3 interactive plot. However, some countries/regions’ names in the D3’s geo-map (list of countries are in “data/raw/global-country.tsv”) are different from what we have in the dataset. For the data used in D3, we need to change the country names to make them identical to those in D3’s geo-map. The following countries/regions have different names. The modified data file is stored as “data/clean/df_d3.csv”.
in_dataset | in_d3_geo |
---|---|
Russia (Soviet Union) | Russia |
Yemen (North Yemen) | Yemen |
Myanmar (Burma) | Myanmar |
Cambodia (Kampuchea) | Cambodia |
Ivory Coast | Cote d’Ivoire |
Zimbabwe (Rhodesia) | Zimbabwe |
Serbia (Yugoslavia) | Serbia |
Macedonia, FYR | Macedonia |
Kingdom of eSwatini (Swaziland) | Swaziland |
Rumania | Romania |
United States of America | United States |
Madagascar (Malagasy) | Madagascar |
DR Congo (Zaire) | DR Congo |
China | China, Mainland |