by Dr. Johanna Choumert Nkolo and Callum Taylor
EDI recently participated in the 9th International Conference on Social Science Methodology. The conference was organized by the International Sociological Association (ISA) Research Committee RC33 on Logic and Methodology and was held at the University of Leicester between the 11-16th September 2016. There were more than 50 sessions, with more than 300 presentations on social science methodology, covering topics such as administrative data quality, web surveys, measurement errors in surveys, qualitative methods, spatial data, paradata, and big data. The full program can be accessed here along with the book of abstracts.
EDI presented 2 papers at the conference:
- “Using CAPI to Improve and Evaluate the Quality of Socio-Economic Surveys”was presented in the session “Emerging Methods for Evaluation Survey Measurement Quality”. This paper provided insights into how Computer Assisted Personal interviewing (CAPI) technology can be used by researchers and survey practitioners to evaluate survey measurement quality. This built largely on EDI experience in developing our CAPI software, Surveybe, in implementing dozens of large-scale surveys in East Africa, and our use of paradata to monitor fieldwork and evaluation data quality.
- “Experiences and Challenges in Data Collection Monitoring: From Bukoba (Tanzania) to High Wycombe (United Kingdom)” in the session “Monitoring Data Collection in International Settings”. The goal of this contribution was to add to current discussions about international settings for monitoring data collection in developing countries. Usually, an international firm is in charge of the technical expertise and partners with a local organization for the local knowledge. Our own experience shows that combining both within a unique organization is possible and sustainable. Our contribution was twofold; we showed (i) how the tools we use enable us to produce high quality data, and (ii) how our organization’s setting promotes constant improvement of current international best practices in data collection.
We were also able to attend sessions on a wide range of topics, many of which contained useful lessons relating to our own work. This included:
DUPLICATED OBSERVATIONS. In their paper “Estimation bias due to duplicated observations: a Monte Carlo simulation”, Sarracino & Mikucka investigate the statistical consequences of non-unique observations, i.e. duplicated data. Using Monte Carlo simulations, they show how duplicate records affect regression estimates. Several other papers have been published on the topic, such as “The Large Number of Duplicate Records in International Survey Projects: The Need for Data Quality Control” where Slomczynski et al. (2015) examine 1,721 national surveys from 142 countries, covering 2.3 million respondents.
They consider that “a record is erroneous, or suspicious, if it is not unique — that is, when the set of all answers of a given respondent is identical to that of another respondent.” Although the number of non-unique observations is a small percentage of the total data, they find 5,893 duplicate records in 162 of the national surveys.In another paper “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys”, Kuriakose & Robbins (2015) analyse 1,000 publicly available public opinion survey datasets and find that almost 1/5 have exact or near duplicates in excess of 5% of observations. They also provide a STATA module percentmatch to calculate the highest percentage match between observations. In “Evaluating a New Proposal for Detecting Data Falsification in Surveys. The underlying causes of “high matches” between survey respondents”, Simmons et al. (2016) from the Pew Center provide an interesting response to Kuriakose &Robbins. They stress the importance of taking into account the number of questions, response options, number of respondents, and other characteristics and parameters of a survey when investigating the percentage of high matches in a dataset.
The existence of duplicates in datasets decreases data quality and the key lesson learnt from this presentation and the associated papers is that being careful during data collection is the very first necessary step to reduce duplicates. At EDI to avoid duplicates, we provide extensive training to our field teams around the importance of data quality and accuracy. Our field supervisors, data processing officers and project coordinators provide close supervision of interviews to ensure field protocol are rigorously followed. Furthermore, our data teams investigate paradata and data quality from day one of data collection as explained in our paper “Using CAPI to Improve and Evaluate the Quality of Socio-Economic Surveys”.
EMERGING METHODS FOR IMPROVING SURVEY QUALITY. Many presentations discussed the methods and tools that can be utilised for improving data quality during fieldwork and data quality measurement for primary data and secondary data. When collecting data, errors can arise due to a large variety of reasons, such as: recall bias, poor understanding of questions, anchoring bias, education level, selection of respondents, etc. This has led many commentators to conclude that there is no such thing as a ‘perfect’ data set. However, there are methods which can be used to assess the quality of data, see for example “Assessing the Quality of Survey Data” (Blasius and Thiessen, 2012). During the session “Emerging Methods for Evaluating Survey Measurement Quality”, P. Beatty presented a new program developed at the U.S. Census Bureau on these methods, combining the analysis of paradata, administrative data, social media data and other forms. Several presentations stressed the importance of paradata to monitor fieldwork and evaluate data quality. This is without a doubt a topic that offers great perspectives both for researchers and data users, especially with the rise of CAPI and Web surveys.
SOCIALLY DESIRABLE RESPONSE BEHAVIOR AND PARADATA. The presentation “Analyzing socially desirable response behaviour using paradata” by Andersen & Mayerl, provided interesting insights on how paradata can be used to examine the behaviour of respondents facing sensitive questions or questions with‘socially desirable’ answers. They used response-latencies, response changes and patterns of non-responses obtained from web surveys to measure survey quality.
OTHER USES OF PARADATA. One of the most common paradata is the measurement of time throughout the survey, both in terms of the overall length, but also the timings of sections and questions, and how interviewers and respondents move through the survey. Unusual section times could indicate difficult sections, or misunderstanding or error on the part of the interviewer. The use of time stamps to analyse data quality was explored in a presentation by researchers from the Pew Research centre, based on the paper “Using Timestamps for the Evaluation of Data Quality” by van Houten et al. Their focus was on a multidimensional approach and evaluating the relationships between time measures and other data quality factors such as frequency of non-response answers and uniformity and variance of responses.
WEB-BASED SURVEYS. There was much discussion during the conference about the relative merits of standard CAPI (having an interviewer conduct an interview with the respondent using a computer/tablet/phone), and also of web-based surveys (where the respondent answers the questionnaire themselves using their own equipment). Web-based surveys were seen as an exciting opportunity to reach large numbers of respondents around the world in a relatively cheap and efficient manner, whereas CAPI is generally thought to be more reliable and accurate. The lack of an interviewer motivating respondents to complete the questionnaire and to optimize responses given is seen as a major disadvantage of web-based surveys. However, Metzler& Fuchs presented a paper on “Using response time to predict survey break-off in Web surveys” which aimed to use paradata to predict when respondents are more likely to leave an interview. Studies such as this could potentially present interesting conclusions about the behaviour of respondents during interviews.
GEOGRAPHICAL INFORMATION SYSTEMS FOR SAMPLING. Sampling is often very challenging in developing countries. Depending on countries and the survey context, the different existing strategies (full listing of household, listing with a community leader, random walk, etc.) present different costs and benefits in terms of financial cost and time. The increasing accuracy and availability of Geographic Information Systems (GIS) offers great opportunities for sampling and fieldwork monitoring(for example, by using passive collection of GPS coordinates). See the presentation “Use of GIS for Sampling and Fieldwork Monitoring in Developing Countries”by Letterman & Cajka for more information on this topic.
NON-RESPONSE SURVEYS (NRS). NRS relate to conducting follow-up surveys to non-respondents in order to investigate nonresponse bias with well-known consequences on sample representativeness. In the paper “Promises and failures of nonresponse surveys” presented in the session “Assessing the Quality of Survey Data”, Pollien & Stähli propose an analysis of non-response follow-up surveys. Further insights on NRS and the reliability of the information obtained can be found here. NRS present several challenges: (i) Individuals belonging to the group of non-respondents may not which to participate in a NRS, therefore the data obtained in a NRS does not provide information for the whole sample (ii) Initial respondents, respondents of the NRS and remaining non-respondents may have different characteristics (iii) Questions asked in the NRS should not be sensitive to survey design effects.
This conference reflects the increasing importance given by researchers, Bureaus of Statistics, International Organizations and others to survey methodology and data quality. From our perspective, the take-home message is that collecting relevant paradata and pushing the frontiers of the types of paradata which can be collected will open new horizons to data collectors and data users in their quest for the perfect dataset.