The prominence of surveys is based on the assumption that survey results are accurate. Although a great deal of research has sought to identify best practices for survey data collection, no comprehensive review exists of the empirical evidence evaluating the accuracy of survey measurements. This paper provides a thorough review of survey measures of objective phenomena, which can be validated against benchmarks not based on self-reports. This review spans a wide spectrum of content areas, including crime, demographics, economic indicators, healthcare, labor force statistics, market research, philanthropy, politics, psychology, substance abuse, media, and much more. The vast majority of measurements were found to be highly accurate according to four assessment methods:
- Most respondents provided self-reports that matched objective measures or records of those same attributes for each individual.
- Small differences were observed between survey sample estimates of means and proportions vs. aggregate estimates of means and proportions derived from objective records.
- Respondents' self-reports correlated strongly with measures of the same attributes of those individuals derived from secondary data not based on self-reports.
- And there was strong correspondence in trends over time documented by survey measurement and by non-survey measurement of the same phenomena.
Taken together, the evidence suggests that survey measurements of objective phenomena are generally highly valid. Our literature review also revealed that objective records are mostly flawed and incomplete, and may not necessarily be more accurate than survey data.
Background
The prominence of surveys is based on the assumption that survey results are accurate. A central repository of evidence that speaks to the accuracy of survey estimates can go a long way towards achieving several objectives. Perhaps most importantly, we need this collection of evidence because professionals in the field of survey research routinely need to demonstrate the value of surveys to audiences who are skeptical of the reliability and validity of survey measures. Furthermore, a comprehensive review of this evidence could help researchers compare different ways of conducting surveys in order to identify methods that yield the most accurate results. We could also document problems and pitfalls, and possible ways to improve survey measurement when accuracy is not that high.
A number of authoritative books and papers have provided thorough instruction on the various sources of survey errors, including measurement errors (when respondents provide incorrect information either intentionally or inadvertently, when interviewers or survey mode unconsciously affect how respondents answer questions, or when survey items fail to capture the underlying construct effectively, etc.), nonresponse errors at both unit and item levels, sampling frame errors including noncoverage errors, sampling errors including sampling scheme and sample size deficiencies, and much more (Biemer, 2010; Biemer, Groves, Lyberg, Mathiowetz, and Sudman 1991; Biemer and Lyberg 2003; Deming 1944; Groves 1989; Groves, Fowler, Couper, Lepkowski, Singer, Tourangeau 2009; Groves and Lyberg 2010; Lessler and Kalsbeek 1992; Weisberg 2005). In contrast to the plethora of literature on survey error, there is no comprehensive collection of positive evidence on survey accuracy. This imbalance in the literature could convey an unwarrantedly negative view of survey validity and reliability to the average consumer of research results.
To this end, we provide a thorough review of survey accuracy of objective phenomena across a wide spectrum of content areas. We focus only on self-reports of events, behaviors, acts, and physical attributes that can all be objectively verified – survey measures that have no objective validation were excluded. For example, we excluded papers that validated respondent’s smoking behavior by test-retest measures that were all based on self-reports, and included only papers that validated respondent’s smoking behavior against an alternative biometric indicator that was not based on self-reports. We included only peer-reviewed research publications that provided original empirical evidence that speaks to survey accuracy, and used literature reviews only as a reference to locate papers based on primary research – that is, we did not extract accuracy numbers directly from literature reviews. Further, we included only survey measures about past or present behaviors or attributes, and excluded survey items that asked respondents to predict their behaviors at some point in the future, e.g. who they will vote for in the upcoming election. Because respondents are entitled to change their minds anytime after the survey, it is not entirely fair to evaluate survey accuracy at one point in time against an objective benchmark taken some point in the future.
We studied over 1,000 instances of survey accuracy assessment from over hundreds of papers documenting original empirical evidence. Some of these papers were focused on assessment of survey accuracy; others reported survey accuracy as a tangential or incidental finding. We found papers from diverse areas of research, including alcohol use, crime & deviance, demographics, dental care, economic indicators, education, estimated frequency of events, existing medical conditions, home value appreciation, healthcare utilization, height & weight, illegal substance use, labor force statistics, market research, media, medical costs, medical diagnosis, medical screening tests, past medical conditions, philanthropy, prescription drug use and compliance, smoking, social services utilization, tax evasion, voter turnout, etc.
Methods to Assess Survey Accuracy
We used four different methods to assess accuracy. It was apparent that alternative methods were needed because different papers presented their empirical evidence using vastly different formats and indices. In sum, we had to collate the accuracy statistics in the following four categories:
- Match each respondent's self-report with objective individual records of the same phenomena. In this set of papers would be comparisons between individual respondents' self-reported hospital stay vs. hospital record on those same individual patients, or comparisons between individual respondents' self-reported tax evasion vs. government tax records on those corresponding individuals, or comparisons between individual respondents' self-reported home ownership vs. county records on the presumed homeowner.
- Match one-time aggregate survey estimates with available benchmarks from non-survey data on the same set of individuals. In this set of papers would be comparisons between the overall mean height or weight computed from aggregating the entire survey sample vs. the overall mean height or weight computed from actual physical measurements, or average aggregate estimates of self-reported outcome when employment status vs. aggregate estimates from government unemployment benefits register records from the same sample, average self-reported alcohol consumption vs. blood alcohol lab tests conducted on the same sample of individuals.
- Correlate self-reports on surveys and secondary objective data from same set of individuals. In this set of papers would be correlations or other associative measures between self-reported weight vs. actual measured weight, self-reported height vs. actual measured height, self-reported number of doctors’ visits vs. number of visits in HMO administrative records, self-reported number of days of sick leave vs. employer’s sick leave records etc.
- Correlate trends over time between longitudinal aggregate survey estimates vs. available longitudinal benchmarks. These comparisons were limited to samples that were drawn with known probability of selection, and weighted to best represent the target population. Because the longitudinal benchmarks were not based on the survey sample per se, but from the full population or other external indicators. For example, comparisons between NCVS (National Crime Victimization Survey) crime rates vs. FBI records, NES (National Election Studies) voter turnout rates vs. FEC records, SCA (Survey of Consumer Attitudes) consumer perceptions vs. real GDP.
Our literature review also revealed that objective records are mostly imperfect. For example, hospital records are incomplete on some patients due to administrative filing errors or because clinicians are too busy to fill out all information or submit forms on time (e.g. King, Rimer, Trock, Balshem, and Engstrom 1990; Marshall, Deapen, Allen, Anton-Culver, Bernstein, Horn-Ross, Peel,Pinder, Reynolds, Ross, West, and Ziogas 2003; Schmitz, Russell, and Cutrona, 2002), or because receipt of services outside a specific healthcare network (Wallihan, Stump, and Callahan 1999) or geographical region (Desai, Bruce, Desai, and Druss 2001) was not recorded. In some healthcare networks, clinicians would review a patient’s case history or test results without the patient’s awareness, such as lab or radiologic services, or peer consulting among clinicians that had only peripheral contact or no direct contact with patients; hence care by a clinician may show up on HMO records but not reported by patients - such omissions cannot be attributed to respondent error because the patients were not aware of the involvement of that clinician (see Cronan and Walen 2002; Rozario, Morrow-Howell, and Proctor 2004).
Similarly, government tax audit records have also been found to be incomplete and contained errors for some tax payers (e.g. Hessing, Elffers, and Weigel 1988). Objective tests such as measuring the concentration of carbon monoxide in respondents' breath can fail to detect tobacco use if the respondent did not smoke heavily or recently (e.g. Petitti, Friedman, and Kahn 1981), and similarly urine analysis can fail to detect marijuana trace in some respondents (e.g., Murphy, Durako, Muenz, and Wilson 2000). Further, not all crimes are reported to the police, thus we cannot expect police and FBI records to capture all crimes committed. So when we compare the aggregate estimates of crime incidence from FBI records to validate the aggregate estimates from the NCVS, we should focus on whether these two estimates consistently track each other over time, rather than expect a perfect or close agreement between the exact incidence numbers. In short, because the objective benchmarks are not free of errors, it is possible that survey accuracy is underestimated in some of the evidence reviewed here.
Matching Self-reports and Objective Individual Records
Our review of the literature uncovered a total of 555 survey measures where individual respondents’ self-reports were matched against objective records of the same phenomena for the same individual. Taken together, these comparisons matched over half a million individual self-reports against over half a million individual objective records across multiple research domains. The average across these 555 survey items yielded a mean overall accuracy of 78% and a median overall accuracy of 84%, which means that around 4 in 5 respondents’ self-reports matched their objective records perfectly.
Moreover, not all 555 survey items were pitched against robust objective benchmarks. Some researchers noted serious errors in their objective records, such as conflicting information between different records and databases; errors linking to identifying information; database missing outpatient procedures and reasons for visits (Desai et al. 2001; Marshall et al. 2003; Roberts et al. 1996). In some papers, clinicians reported by respondents did not work at the same hospital where the clinical trial records were kept, hence objective records were incomplete for those respondents (Bergmann, Byers, Freedman, & Mokdad 1998; Bergmann, Calle, Mervis, Miracle-McMahill, Thun, & Health 1998; Petrou et al. 2002). In other studies, clinicians known to review patient files without patient awareness, so it is not respondent error if they did not report a doctor that they had never had personal contact with (Ritter et al. 2001). Although objective tests are never error-free, there were some instances when there was just too many errors in an objective test for it to be considered a credible benchmark, e.g. liver function tests that failed to detect drinking even when respondents who were alcoholics reported alcohol consumption (Babor et al. 2000). In the domain of real estate valuation, the "objective benchmark" was based on professional appraisals, but authors note how such appraisal estimates were sometimes erroneous in how appraisers selected properties deemed to be comparable, which part of the property to include or exclude in valuation, and making estimate on commercial rather than residential value of property (Kain & Quigley 1972; Kish & Lansing 1954).
To retain only the comparisons with defensible objective validation, we excluded 105 survey measures where the researchers had noted serious errors in their objective records. Among the remaining 450 survey items where respondents’ self-reports were matched against valid objective records, the mean overall accuracy was 85% and the median overall accuracy was 88%, which means that almost 9 in 10 respondents’ self-reports matched their objective records perfectly.
The accuracy of respondents’ self-reports varied by research topic. As shown below, accuracy was much higher in self-reports of smoking behaviors and deviant behaviours (minor transgressions that may or may not have resulted in ticketed by police) than self-reports of educational achievement test scores and grades. Nonetheless, overall accuracy is remarkably high in most domains; percentage of perfect match between individual survey answers vs. objective data was:
- Smoking - 95% (n=9755)
- Crime & Deviance - 93% (n=1242)
- Demographics - 92% (n=7148)
- Recent Medical Diagnosis - 86% (n=34268)
- Alcohol Use - 86% (n=561)
- Chronic Ailments - 90% (n=282716)
- Healthcare Utilization - 89% (n=63531)
- Dental Care - 88% (n=29138)
- Labor Force Statistics - 88% (n=5051)
- Consumer Purchases - 83% (n=2448)
- Illegal Substance Use - 85% (n=31785)
- Medical Screening Tests - 77% (n=6241)
- Prescription Medication Use - 78% (n=5057)
- Medical Costs/Payments - 75% (n=534)
- Height & Weight - 75% (n=3290)
- Voter registration/turnout - 72% (n=5520)
- Philanthropy - 66% (n=920)
- School grades & test scores - 54% (n=30926)
It was apparent that accuracy was lower on specific survey items even within the same survey of the same sample. For example, respondents over-reported on whether they made charitable donations even though they were remarkably accurate on other validated measures on the same survey (Parry and Crossley 1950). Similarly, whether testicular self-examination was discussed or taught during medical examination was under-reported by a third of the same sample that was much more accurate when reporting on other tests or procedures on same survey (Brown and Adams 1992). Similarly, self-reports of whether one was circumcised was much higher than medical records, with 46% of patients with primary prostate cancer reporting they had been circumcised when their medical records did not indicate so; but medical records matched self-reports on many other measures from the same sample (Zhu et al. 1999). Interesting, self-reports of whether one had cheated on taxes were not accurate in two ways - tax evaders who under-reported and presumed non-tax-evaders who reported having done so - thus bringing to light that government audit records were incomplete in documenting tax evasion (Hessing, Elffers, and Weigel 1988).
The occasional comparisons yielding lower accuracy were also more likely to emerge among some subgroups of the population. For example, respondents with impaired functionality, such as people diagnosed with alcohol abuse and dependency (Killeen, Brady, Gold, Tyson, and Simpson 2004), tend to produce more errors when matched against objective records. Or respondents in lower socio-economic segments of the population may not know the correct medical jargon for procedures or test results conducted on them, and hence tend to make more errors when reporting their past or present healthcare utilization (e.g. McKenna, Speers, Mallin, and Warnecke 1992; Michielutte, Dignan, Wells, Bahnson, Smith, Wooten, and Hale 1991). Sometimes, low accuracy may have been due to item wording that was difficult to understand. For example, in one study of the 139 patients with gingival disease, 71% said they never had gum disease when they actually did (Helöe 1972). Yet the same respondents produced accurate match on other dental care items in same study, so it is possible that errors on this single item was due to respondents' lack of understanding of what constitutes gum disease.
Finally, certain items seem to have been under-recorded rather than over-reported. For example, almost twice as many pregnant and postpartum respondents reported they took vitamins or supplements during their pregnancy than indicated on their medical records data (Bryant et al. 1989). Clearly, this gap should not be fully attributed to respondent error; but is more likely due to patients not reporting their supplement intake to clinicians or clinicians failing to record this information.
Matching Survey Estimates and Objective Aggregate Records
Some papers reported the agreement between aggregate statistics instead of proportion of individual record matches. Our review of the literature uncovered a total of 399 survey measures where the aggregate sample mean estimates were matched against aggregate mean estimates based on objective records of the same phenomena from the same group of individuals. Of these 399 aggregate comparisons, 381 compared the difference between the survey sample mean and the objective records mean, while 18 instances computed some version of ratio of the survey sample mean to the objective records mean. If we focus on the 381 instances where we had information on absolute differences between means, we found that 8% produced perfect matches between the survey sample mean and the objective records mean, 38% produced differences of 1 unit or less on the measurement unit of interest (e.g. cm, kg, dollars, days, hours, percentage points, number of hospital visits), and 73% produced differences of 5 measurement units or less. In short, the majority of survey sample estimates closely matched objective estimates.
It was striking that the largest gaps between survey sample estimates and objective estimates consistently emerged in the domain of homeowners' estimates of their house values (Benítez-Silva et al. 2008; DiPasquale and Somerville 1995; Kain and Quigley 1972; Robins and West 1977). Across these 4 independent studies conducted in different decades, the same pattern emerged - homeowners tend to overestimate their home values compared to alternative benchmarks. The largest gap documented was by DiPasquale and Somerville (1995), who found that recent movers sampled by the American Housing Survey reported an aggregate mean value of $109,854 for their homes, while actual home sales transaction records for the same people revealed an aggregate mean of only $102,408, implying that the survey estimate was over $7,000 less than the true parameter. It is possible that these homeowners inflated their home value by including other closing costs and expenses during home purchase. Nonetheless, this level of mismatch between survey sample estimates and objective estimates was atypical, and not found in any other research domain in this literature review.
Correlate Self-reports with Objective Data
Some papers reported the strength of associations between self-reports on surveys and secondary objective data from same set of individuals. Our review of the literature uncovered a total of 168 survey measures where the sample data points were correlated with the data points from objective records. These papers reported different statistics reflecting strength of associations, namely the Pearson’s product-moment coefficient was used to validate 47 survey items, Spearman-Brown split-half reliability coefficient was reported on 16 items, the Spearman’s rank correlation coefficient was used for 9 items, the intraclass correlation coefficient was used for 21 items, the Yule coefficient of association was used for 10 items, and the Cohen’s kappa coefficient was used for 65 survey items. The summary statistics shown below demonstrates that although the associations were never perfect, self-reports on surveys were significantly associated with objective records on the same individuals:
- mean = 0.77 Yule's coefficient of association (number of measures = 10)
- mean = 0.70 Intraclass Correlation (ICC) (number of measures = 21)
- mean = 0.69 Pearson’s product-moment coefficient (number of measures = 47)
- mean = 0.60 Spearman’s rank correlation coefficient (number of measures = 9)
- mean = 0.57 Cohen’s kappa coefficient (number of measures = 65)
- mean = 0.32 Spearman-Brown split-half reliability coefficient (number of measures = 16)
Correlate Longitudinal Survey with Objective Trend Data
Finally, the last method we used was to correlate trends over time between longitudinal aggregate survey estimates vs. available longitudinal benchmarks. These correlations could only be conducted with longstanding large-scale surveys that provided their sample estimates over time as public information, and where some alternative objective benchmark could be also be obtained in the public domain. Due to these requirements, we could find only 6 instances where such comparisons could be conducted. As mentioned earlier, in all the previous 3 methods, we included only items measuring respondent’s perceptions or behaviors in the past or present; and excluded their expectations or intents about future. But for this method, we allowed correlations of objective benchmarks with survey measures that captured people’s expectations or intents in the near future as long as the reference time period was the same. As shown below, these correlations between aggregate longitudinal survey estimates and available longitudinal benchmarks are remarkably high:
- Pearson’s r =0.96 between 17 years of data on Teen Drinking & Driving from the Monitoring the Future (MTF) Youth Survey and NHTSA counts of teenage drivers in alcohol-related fatal crashes
- Pearson’s r =0.94 between 40 years of data on Voter Turnout from the American National Elections Study (ANES) and FEC records
- Pearson’s r =0.91 between 27 years of data on Crime Victimization from the National Crime Victimization Survey (NCVS) and FBI crime records
- Pearson’s r =0.9 between 35 years of data on Consumer Assessment of National Economy from the Survey of Consumer Attitudes (SCA) and Real Gross Domestic Product (GDP)
- Pearson’s r =0.77 between 25 years of data on Home Purchase Intent from the Survey of Consumer Attitudes (SCA) and Actual Home Sales from JEC report to US Congress
- Pearson’s r =0.73 between 25 years of data on Automobile Purchase Intent from the Survey of Consumer Attitudes (SCA) and Actual Vehicle Sales from JEC report to US Congress
Despite the high level of correlation, there were some systematic gaps in absolute values that were within expectations. For example, in the comparisons of drunk driving among young people, the survey estimate was consistently higher than actual fatal crashes over time. That should not be surprising as teen drinking and driving do not always result in fatal crashes. Similarly, survey estimates of voter turnout were consistently higher than actual turnout. This gap is in line with the wealth of research showing how voters are more likely to complete political surveys than non-voters; but could also be in part due to the fact that government records of turnout and registration are also incomplete (Berent and Krosnick 2010). Lastly, survey estimates of serious violent crimes were consistently higher than police reports of the same crime category. This gap can be attributed to the fact that FBI records contain only crimes reported to police, and not all serious violent crimes are reported to the police, and even when reported, may not be fully documented due to other considerations such as police workload.
Surveys are Accurate
The evidence we have collected thus far show that the vast majority of survey measures of objective phenomena are highly accurate. Some sceptical readers may postulate that published papers are biased in favor of those showing survey accuracy, but our review was across all substantive areas and not methodology journals. In fact, most papers from these substantive areas were not at all invested in proving that surveys are accurate.
Our literature review also suggest some conditions that enhance accuracy and other conditions that reduce accuracy. Specifically, measurement of respondents’ current behaviours or attributes tend to be more accurate than measurements of behaviours or attributes that had occurred in the past. Among the 450 survey items where respondents’ self-reports were matched against valid objective records, the mean overall accuracy was 89% for survey items about current phenomena vs. 84% for survey items about past phenomena (t= 3.21, p<.001). Not surprisingly, sensitive questions tend to be less accurate; such as survey responses on testicular self-examination tend to yield lower match to objective records compared to survey responses on one’s height and weight. Similarly, questions that invoke socially desirability bias (e.g. charitable donations) were sometimes less accurate but not consistently so. Surveys with larger samples tend to be more accurate than surveys with smaller samples; whereby among the 450 survey items where respondents’ self-reports were matched against valid objective records, the mean overall accuracy was significantly correlated with sample size at the level of Pearson’s r=.13, p<.01. However, the magnitude of this association was not dramatic and it was unclear if the association was due to sample size per se, or due to other factors associated with sample size, such as a bigger budget or more experienced researchers who can ensure data quality.
As aforementioned, in instances when survey accuracy was lower than average, it could sometimes be attributed to errors in the objective benchmark instead of errors in survey measurements. These errors in the objective benchmark included incomplete administrative records, errors in billing records, insensitive or overly sensitive medical tests, and much more. Nonetheless, for the most part, these objective records and measures still afforded us the invaluable opportunity to evaluate accuracy of survey responses.
Source: Chang, LinChiat, Jon Krosnick, Elaine Albertson, and Elizabeth Quinlan. 2011. How Accurate Are Survey Measurements of Objective Phenomena? Paper presented at the annual meeting of the American Association for Public Opinion Research