Chapter 3 Data transformation

The dataset has a .RData version, which we can directly load using R, therefore, we do not have to worry much about reading the data into R. However, we would like to change the data class of several columns to fit our study better.

In the original data, all columns with fully numeric values will be considered as “numeric.” For example, the column year would be considered as numeric, and we don’t need to transform it further. However, the type_of_violence column will also be considered as numeric, in which it should be categorical. Therefore, we change the 1, 2, and 3 values in type_of_violence to categories – “state-based conflict”, “non-state conflict”, and " one-sided violence." Further, we would like to change the date_start and date_end to the DateTime class.

The dataset we chose is very larger, to save the space, we would like to reduce the dataset only with columns we are might be interested in (as described in Chapter 2). The transformed dataset is stored in “data/clean/df_process.RData”. But in certain instances, we may want to use the original dataset (for example, missing value analysis).

In our study, we find the most extreme values of this dataset are related to the Rwanda genocide in 1994, which we will have detailed explanations in the following chapters. Sometimes, we need to include these records in our analysis, and sometimes we don’t. Therefore, we decide to have a data frame to record the unique conflict event ID of records related to the Rwanda genocide. Then we can quickly include and filter our these records in our future coding. The data frame is stored as “data/clean/df_rwa.RData”.

For other transformations made for analysis and visualization, please refer to our code on GitHub repo.

In our project, we would also like to investigate the geographic or spatial pattern of the data via the D3 interactive plot. However, some countries/regions’ names in the D3’s geo-map (list of countries are in “data/raw/global-country.tsv”) are different from what we have in the dataset. For the data used in D3, we need to change the country names to make them identical to those in D3’s geo-map. The following countries/regions have different names. The modified data file is stored as “data/clean/df_d3.csv”.

in_dataset in_d3_geo
Russia (Soviet Union) Russia
Yemen (North Yemen) Yemen
Myanmar (Burma) Myanmar
Cambodia (Kampuchea) Cambodia
Ivory Coast Cote d’Ivoire
Zimbabwe (Rhodesia) Zimbabwe
Serbia (Yugoslavia) Serbia
Macedonia, FYR Macedonia
Kingdom of eSwatini (Swaziland) Swaziland
Rumania Romania
United States of America United States
Madagascar (Malagasy) Madagascar
DR Congo (Zaire) DR Congo
China China, Mainland