Investigating Concerns about Lucid Theorem Data Quality
Published:
The blossoming use of survey research in political science heightens the need for rigorous investigation into data quality. To obtain samples, academics often rely on survey vendors such as SurveyMonkey, Amazon’s Mechanical Turk (MTurk), and Lucid Theorem. Use of data from Lucid has become increasingly prevalent, and the focus of growing data quality concerns.
More recently, Josh Kalla posted a thread on Twitter noting that when running a near-replication of a survey from March 2021 in June 2021, the median response time had dropped over five minutes. This result is particularly troubling, as a dramatic decrease in median response time could indicate extensive satisficing by respondents, colloquially referred to as “speeders,” and potentially induce bias into the results.
Still, while these findings certainly raise cause for concern, any response time issues between these particular two surveys may reflect nuanced differences in the question battery or result from pure chance. A survey with a static battery of questions fielded consistently and repeatedly via Lucid over a wide time period is the ideal way to clarify whether there have been recent and notable decreases in response time among Lucid’s respondents.
Fortunately, I have access to such a survey: the Data For Progress Covid-19 tracking poll, fielded via Lucid by Brian Schaffner. There have been 25 waves of the poll between April 2020 and May 2021, and besides small modifications has remained almost identical over that time period. I used data from the tracking poll to investigate whether our data from Lucid saw any changes in response time over the fourteen-month period, and check whether the Lucid-provided partisan measure correlated strongly with the ideology question included in the poll.
Median response time was generally stable
The median response time was generally stable until October, when it dropped through the fall and winter, returning to previous levels in the spring. Between April 2020 and May 2021, the median response time to the DFP Covid-19 tracking poll ranged from 12.35 to 14.43 minutes. In the poll’s first eight iterations, response times mostly increased, reaching the poll’s greatest median response time during Wave Eight on June 09, 2020. Over the next eight waves, between June 23 and September 22, the median response time was fairly static and stayed between 13 and 14 minutes. Response times dropped to twelve minutes in October and stayed there over the next six waves into February 2021. The February 09, 2021 wave saw a slight increase in median response time, and the March, April, and May waves all had median response times near the previous mid 13-minute mark. For the most part, the first and third quartiles followed similar trends, suggesting that there were no dramatic changes in the distribution of response times among our Lucid Theorem respondents.
Still, while tracking the median response time over 13 months is a convenient measure to identify whether the proportion of speeders among Lucid respondents have been increasing, we might prefer to examine the full distribution of response times over all 25 waves of the Covid-19 tracking poll to glean more information. I visualized this distribution for each wave below using the {gganimate}
package. Plotting the median, as well as the first and third quartiles, helps identify any notable shifts in the distribution of response times.
Using correlations to check data quality
We can also attempt to measure data quality by testing whether variables that should be correlated based on theory actually are correlated in our data. Lucid provides respondent demographics across several variables, including age, education, ethnicity, gender, household income, party identification and region. I also investigated whether the respondent party identification information provided by Lucid is correlated as expected with respondent-provided political ideology. To provide a baseline, I coded identical political party and political ideology using the 2020 CES, which had a correlation coefficient of 0.668.
We see that the correlation between party identification and ideology was higher than 0.5 in a majority of waves, but remains lower than expected from the CES data. There were two notable shifts. The correlation fell, rose, and then dropped again over the five waves fielded between May 5, 2020 and June 9, 2020. There was also a sharp drop in the correlation between Lucid-provided party identification and respondent ideology from 0.594 on September 22, 2020 to 0.391 on October 6, 2020. The correlation has risen steadily since then, and was 0.624 for the wave fielded on May 15, 2021.
Conclusion
Overall, as the median and quartile measures for response times remain fairly stable over time and follow similar trends, I find limited evidence that the proportion of speeders have recently been increasing among respondents recruited via Lucid Theorem. Additionally, research finds that speeders have a limited effect on data quality, potentially assuaging concerns about external validity even if speeders were becoming more common among data gathered using Lucid Theorem.
Still, these findings follow important caveats. The data from the DFP tracking poll includes respondents who pass attention checks. Yet an increasing number of survey respondents are failing attention checks, and those who do fail are markedly different from attentive respondents. Speeders may be increasing among the population of Lucid respondents but end up filtered from the data after failing attention checks, and attention checks remain critical to maintaining data quality. To guard against inattentive respondents and continue to monitor for potential declines in response quality among survey vendors, researchers must continue to include attention checks while building in questions and methods to independently audit their data.