Author: Emma Cazou
Have you ever been collecting data and needed to create one final dataset for your analysis? However, the data you have is from different levels – a mixture from the national level, regional level, household level, or village level. Your aim is then to create a final dataset with all this information, taking into account different levels of disaggregation. In this blog, I will provide you with several tips based on the experience of EDI Global on merging datasets at different levels using the statistical software Stata.
In order to merge datasets at different levels, you can use two commands: merge or append. These two commands will be used according to what information is to be added to your master dataset. You will:
– merge if you want to add variables to an existing dataset, and
– append if you want to add observations to a data set.
In the sections below, I will provide some tips on the different steps to follow to avoid mistakes and keep all the necessary information.
1. Determine an intuitive system of identifiers
For each observation, a unique respondent identifier should allow the user of the dataset to identify the observation. An observation could be an individual, a household, a school, an enterprise, a health facility, a doctor, a student, etc. or, if you are working with macro data, a district, a region, a country, etc. In order to trace a unique observation, different techniques can be used.
We recommend using an intuitive system of identifiers, allowing data users to easily navigate through the data and observations. An effective way of achieving this is by creating a system that begins from a larger scale and zooms in on each stratum of the respondent, in order to have a unique way to identify them. In this way, the ultimate unique respondent identifier is built concatenating all other higher-level identifiers. Table 1 presents an example of how this can be done.
Region | Region ID | District | District ID | Household | Household ID |
A | 1 | X | 11 | AXA | 111 |
A | 1 | Y | 12 | AYA | 121 |
B | 2 | Z | 21 | BZA | 211 |
Table 1. Example of identifiers in a dataset
In this example, the unique respondent identifier is HouseholdID, which is obtained by concatenating higher-level identifiers like RegionID and DistrictID. In this way, the data user will intuitively understand that the respondent with HouseholdID 121 is belonging to Region with RegionID 1 and District with DistrictID 12. By using this technique, it is easier to map the different levels and will allow merging in a more efficient and intuitive way. Once you have created unique identifiers in a similar way in all the datasets, you can proceed to the merging process.
2. Avoiding common mistakes when preparing datasets
When merging or appending your datasets, the following three commands are often incorrectly used and therefore should be avoided. Otherwise, there can be significant consequences on your final dataset.
-
First, when merging, you must match all the key and common variables. If all the disaggregated information is in both datasets, you may prefer to use 1:1 instead of m:m (many to many), otherwise it will merge all the duplicates. Using m:m is a tempting option as it works all the time. However, you will not be able to spot any duplicates or missing information in a dataset. Hence, as a golden rule, the option m:m should never be used when merging.
-
The second most common mistake with merge and append, is using the option force. Using this option is quite convenient as the merging or appending will work all the time as you will be able to merge/append, string to numeric or numeric to string, without displaying any error message. However, by doing so, you are taking the risk that you may lose some information, as some identical variables might have different formats across datasets, which will generate some missing values.
-
The third common mistake when preparing a dataset before merging or during the cleaning phase, is using the option capture in a loop when you want to edit a group of variables. As for the command force, using capture will edit (i.e. rename, replace, generate, etc.) all the variables included in the loop despite potential errors. Hence, your do-file will continue to run as all the error messages will be deleted. Consequently, all the necessary corrections to the dataset may not be fully completed.
3. Final checks on the dataset:
Whenever you have appended or merged the datasets, the following three final steps should be done:
-
First, especially when using merge, you have to decide on the information you want to keep. In case you have some unmatched data in a dataset, you have to decide what is relevant for your study. For example, Region A is in the master dataset but not in the used one: you have to decide if you want to keep only the matched data (using keep if merge==3), hence drop the Region A from the final sample, or keep all the data.
-
Second, beyond deciding which information to keep, the previous step is also an opportunity to check that you don’t have any missing information. Following the previous example, you realised that Region A should be in the used dataset – by carefully looking at the merging output, you then realised that you have some missing information and can then investigate this error. The same applies during the appending, by checking for each variable (using the command codebook [variable name], for example) the presence, or not, of missing data.
-
Third, you have to ensure that you do not have duplicated observations. Duplicates can have an impact on the analysis by overrepresenting some information which differs from reality. To check this, use the command duplicates list [variables list].
After completing each step carefully, it’s always a good idea to ensure that you keep a clear note in your do-file. By doing this, you can ensure that all commands related to merging or appending are correctly followed. Furthermore, a clear set of instructions will provide a basis for other data users to follow your output on merging datasets.