The Limits of Simulation in Public Opinion Research

Jul 14

Can computer programs supplant public opinion data collection? That is the promise of “synthetic respondents” or “silicon samples.” We are not so sure.

The dream of replacing human survey respondents with computer generated ones is far from new. As historian Jill Lepore documents in her book If/Then, in the 1960s the Simulmatics corporation sold predictions about human behavior and opinion from its “People Machine” computer simulation to political campaigners, journalists, consumer marketers, and even the Pentagon. Ultimately, that corporation failed when it could not deliver what it sold. While new technology is very different from that era, the central if unfulfilled promises – that you can extrapolate from previously-collected data to reliably predict future opinion or behavior not present in the training data set, avoiding having to go and talk to real people – continues to the present in the idea of “synthetic respondents” generated by Large Language Models (LLMs).

Synthetic respondents are commercially available right now, and artificial intelligence was even the official theme of the most recent conference of the American Association for Public Opinion Research. How can public opinion researchers cut through the hype and distinguish between helpful new tools and what computer scientists Arvind Narayanan and Sayash Kapoor have called “AI Snake Oil”? As a research company that works to stay on the cutting edge of research tools, the uses and usefulness of LLMs is a question that we take seriously.

In this essay we review the scholarship, some of it promising and other much less so, on synthetic respondents produced by Large Language Models (LLMs) like OpenAI’s GPT, Google’s Gemini, and Meta’s Llama, among others, as well as their applicability to various aspects of public opinion research.

In short, that research shows that LLMs can produce samples with superficially similar estimates to some high quality surveys on some questions in some circumstances, but closer evaluation reveals that they differ in important ways that render them unreliable as replacements for actual survey research, both because they fail to produce accurate estimates of variation in opinion, and because they fail to produce reliable estimates of previously unasked questions.Together these systematic problems lead not only to unreliability in top-line estimates of public opinion, but also to inconsistent replication of observed patterns of association across variables, inaccurate representation in qualitative datasets, and the incorrect estimates for necessary sample sizes if used for power calculations in advance of fielding an actual survey. Put another way, these models can sometimes predict opinions (especially the most predictable ones) but should not be used to research opinions. While there are good uses for LLMs by public opinion researchers (e.g. translation of survey instruments), this review of the research makes clear that “silicon samples” are not among them.

Our choice of the term “large language model”, instead of “artificial intelligence”, is an intentional one. LLMs are models, or simplified abstractions of the world, even if highly complex ones. As Henry Farrell and colleagues have argued, it is not only inaccurate but counterproductive to think of an LLM as an external intelligence. Instead they argue we should conceptualize an LLM as a “social and cultural technology” that helps users organize and transform information. To call this technology a “model” instead of an intelligence is not a criticism. After all, surveys are also models of a sort, abstracted representations about the world. In fact, government statistics (often based on surveys) are another example of what Farrell and co-authors call cultural technologies. And many survey researchers use machine-learning models in our work already, such as probabilistic assessments of voter turnout propensity in likely voter models.

But LLMs differ from other models in important ways. The problem of “hallucinations” in LLM output is well documented, and they are by their nature a probabilistic technology, which together (appropriately) raise questions of accuracy and reliability. Moreover, these models remain black boxes to outside users. That is true even of so-called “open source” models, which disclose model weights but not the training data used to create them. There is even a growing field of LLM “interpretability” of researchers trying (with only some success) to figure out how the models they built actually work.

So, how do LLM-generated samples, sometimes called “silicon sampling”, work? The emerging approach to synthetic respondents (at least as described in the academic literature, we cannot be as clear about proprietary models that do not disclose their approach in detail) is to prompt LLMs with a persona, defined by demographic and political traits, occupation, or even first names, and then to prompt the model to either answer a question or complete a statement with that persona. This process is then performed repeatedly, both over a given set of characteristics as well as with varying the characteristics provided to the model.

Evaluations of these models have compared estimates produced in this way on a variety of measures, including sentiment towards partisan outgroups measured in previous studies and past presidential vote choice overall and within particular demographically-defined populations of interest, as well as measures of experimental effects. Argyle et. al. report generally high correlations, overall and within subgroups, for presidential voting from 2016 to 2020, using American National Election Studies (ANES).

Perhaps even more impressively, Hewitt et. al. find very high correlations (up to 0.9) between predictions for experimental treatment effect estimates and estimates from actual experiments conducted as part of the Time-Sharing Experiments for the Social Sciences (TESS) project. Notably, these correlations are stronger with more advanced models, and unpublished studies have as high if not a higher correlation than the published studies. This last part is particularly important because one of the important limitations of “predicting the past”, as with comparison to past ANES results, is the concern that the outcomes being predicted are actually just part of the training data. LLM researchers call this “leakage”, and it is a potential concern when assessing validation studies that compare estimates to previously recorded data, because the black-box nature of LLMs (even for “open source models”) mean we do not know what is in the training data.

So if LLMs are providing estimates of candidate support and treatment effects with relatively high correlations to those found in high quality studies, what is the problem? There are at least two.

First, even if they have similar means to real data, LLM-produced data have lower variation than do responses from real people. Bisbee et. al. conduct a study similar to Argyle and colleagues, though focusing on feeling thermometer scores instead of vote choice. They also find a high correlation between ANES-derived estimates and LLM-produced predications. But they show that the variance of the estimates is far too low, relative to ANES-benchmarks. The LLM produces estimates that are more certain that they should be, which can lead to inaccurate hypothesis tests and conclusions.

Similarly artificially low variance of LLM-produced estimates has been found in other contexts, such as psychological batteries, moral judgment tasks, policy preference survey items from the General Social Survey, in responses to the American Community Survey questionnaire, and in textual responses to open ended questions. Wang et. al. show that such LLM-produced samples flatten the representation of identity groups and ignore variation within those groups.

Second, and more fundamentally, LLMs are machines that interpolate within existing data, rather than extrapolate from it. If they are to be used to replace polling – that is, collection of new data – that is the task that would be asked of them. It is far from clear that this is possible, at least with anything approaching the reliability of fresh data collection. Taking the case of American politics, we have seen over the last several elections the correlates of partisan support change, with more-educated voters once leaning Republican now leaning towards the Democrats, and the reverse for less-educated voters. Models trained on old data may not reflect the patterns in new data. And that is before we get to the problem of unexpected campaign events, such as a mid-election candidate switch or an attempted assassination, as we saw in 2024.

Empirically, Kim and Lee have shown that even models fine-tuned with data from high quality nationally representative samples like the General Social Survey – an approach that performs well at missing data imputation – only modestly succeeds at predicting the answers to unasked questions. They report, “For predicting the population proportion, our model with the 3% margin of error can predict only about 12% of true survey responses in the unasked opinion prediction task.” (p. 15) Furthermore, this error is not random, but rather they find that predictive accuracy varies substantially by respondent characteristics like socioeconomic status and race.

Of course, there are different ways of approaching this problem. For example, it is possible for these models to use web search to provide context to refine their predictions, instead of only relying on previous survey data for pre-training. For instance, the current generation of LLMs can search the web for news content about candidates or current events, information that could be used to inform these model predictions. And indeed, one study by Chu et. al. did something very similar, pre-training models on media content to predict future public opinion about the topics covered in the news. While the correlations in that study on the order of magnitude of 0.4 are impressive for a social science, those are much smaller than the correlations of 0.9 reported in other studies. Put another way, such media-augemented approaches to forecasting public opinion leaves a great deal of room for error.

In short, research on LLMs have shown their predictions can reasonably approximate estimates from high quality survey samples, but they systematically underestimate the variance around those predictions. Furthermore, it is far from clear that they can produce reliable estimates of future opinion, or opinion on unasked questions, which is what they would need to do to replace original data collection.

But as statistician George Box famously said, “All models are wrong, but some are useful.” And as social scientists, we are always asking “compared to what?” So fundamentally, the question to ask is not “Are LLM-produced samples good” but instead “Are LLM-produced samples better than the next best method for specific tasks”. Let us consider several different tasks in turn.

First, what about researchers who want estimates of the public’s views on issues or candidates? The research above showed that for some attitudes – support in a presidential election – this approach might do an okay job. However, research by Sun et. al. comparing silicon sampling to ANES data finds that attitudes on questions other than presidential vote choice are less predictable, both overall and particularly within subgroups of interest like Democrats, Independents and Black voters. And Boelaert et. al. look at four items (political ideology, social trust, religious attendance, and happiness) in each of five countries from the World Values Survey. They conclude “Over half of all estimates are at a substantial distance from the ground truth, and no LLM performs significantly better than random guesses.” Not every survey question is about national American candidates.

To our knowledge, only one attempt to provide LLM-simulated poll results for future elections has been conducted. One startup gave predictions of opinion in seven of the 2024 presidential battleground states and three contests for US Senate before the election, feeding the LLMs news content. The mean absolute error was in line with that of actual surveys, though it both failed to provide estimates for third party candidates, and predicted the wrong candidate winning in five of the ten modeled races.

However, because of the blackbox nature of the model, the specification that the model accessed current news information about the candidates, and the fact that these contests were all highly-polled races, we cannot rule out leakage as an explanation for the model performance. That is, was the model’s performance improved by accessing the result of other surveys in the news? We really cannot tell. For those reasons, we should treat this performance as a high-water mark, not a floor for performance. How these would perform in a race for, say, state legislature is still unknown. That said, there 180 state legislative seats up for election in 2025 alone, plus many more low-salience (and unpolled) offices at the municipal, county, and even statewide offices. If advocates of this method want to show that they can produce reliable model estimates of candidate support without relying on access to existing polling, these contests are where they should start.

But even if these models can generate accurate topline estimates, that does not mean they can answer the research questions to which pollsters need answers. Generally researchers care about not just the averages, but also the relationships within data, and on that count LLM-produced samples appear weaker. Bisbee et. al. find that the relationships in the LLM-generated data do not correspond well to the real-world data benchmark data. (Fundamentally, relationships in data are about covariance, so getting the variance right matters). They regressed feeling thermometer scores, real and simulated, on respondent trait data. Compared to the ANES, some estimates of relationships looked relatively reliable, but the authors note, “we document far worse performance among other covariates, in several instances leading to substantively different conclusions—sometimes with opposite signs—than what we would learn from the actual ANES.” The fact that the relationships between some variables in the synthetic data look like those in the real data does not mean that is true for other relationships, and without real data to serve as a benchmark, we would not be able to distinguish the reliable from the unreliable.

If quantitative use cases are questionable, what about qualitative ones, like responses to open-ended questions or in-depth interviews? Research shows that LLMs (even prompted with personas as above) respond quite differently from human respondents. Their open end responses are also longer with more diverse vocabularies. Qualitative researchers who have tried working with LLM-based simulations of participants report the simulated interactions lack nuance and context, as well as the spontaneity and dynamism that occurs in interviews with real people. Furthermore, the flattening of variation, particularly within identity groups, is seen in open-end responses much as it is in responses to close ended questions. Wang et. al. illustrate that these open-ended responses can also be more stereotypical and essentializing than representative. They note, “Llama-2 for Black women starts most responses with ‘Oh, girl,’ and uses phrases like ‘I’m like, YAASSSSS’” (p.10)

It is worth noting that even the authors of the papers providing the most credible reason for optimism about “silicon sampling”, Argyle and Hewitt and their respective co-authors, explicitly disclaim the idea of using LLMs as substitutes for human respondents. Instead they suggest another use for their methods, performing power calculations to help design studies. But power calculations not only require researchers to specify effect sizes (or differences in predicted means), but also variances. And because of the artificially low variance in the data, the artificial data would produce inaccurately high power estimates, or suggest to researchers in sample sizes that would, in fact, be inadequate for the task.

Advocates of silicon sampling might push back on this dim view of the method by pointing out, accurately, that none of the studies mentioned here used the cutting edge models of today. The speed of model development has outpaced the ability to evaluate their applicability for many uses. And Hewitt et. al. showed that somewhat newer models outperformed older ones in their study (e.g. GPT-4 vs. GPT-3). And similarly, there may be advances in pre-training, fine-tuning, and prompting strategies.

But research always shows that such improvements do not always make LLMs work better for these specific use cases. Human feedback tuned models have in some cases been less representative. And in particular, alignment for safety has been shown to make models less willing to provide negative or harmful sentiments (something real human respondents do all the time) and therefore less accurate. Progress on the models generally does not necessarily translate to better silicon sampling for pseudo-survey research. More generally, to accurately represent survey respondents, models will need to behave in ways generally at odds with the way they appear to be trained. The silicon respondents created with these models will sometimes need to not know the answers to factual questions, or provide wrong answers, or show motivated reasoning. They will sometimes need to be undecided on opinion questions. They will need to be negative, and even offensive at times, not tuned to be pleasing to the reader with attempts at accurate information. These are all well-documented features of public opinion, but not ones that researchers working to advance LLMs are generally seeking to replicate.

Furthermore, while the research cited here speaks to the quality of the synthetic data outputted by models, given the issues – both empirical and fundamental – with this approach, any cost-benefit calculation to their use needs to also factor in the cost of validating the output. This is where LLM-produced samples really compare poorly, because unlike other uses (translation, coding, etc), researchers cannot simply spot-check the work of the model. Rather, to validate a synthetic sample, an entirely new, actual survey needs to be run, in which case the LLM-produced sample adds no additional value, just additional cost.

None of this is to say there are no uses for LLMs within the public opinion research community. At Survey 160 we have found them to be very effective for translations. There is also emerging interest in using LLMs to code responses to open-ended questions, though concerns are being raised about bias and performance for such uses as well, at least without fine-tuning the model.

But what differentiates these uses from synthetic respondents is they can all be spot-checked by human researchers, while to check the results of a silicon sample one would need to conduct an entire separate survey. Ultimately, research on a LLM is not research on a population of interest, but research on a model of a population of interest. And while it is certainly less expensive, in our view, the non-monetary costs in terms of accuracy and reliability outweigh the benefits.

While LLMs are new, one does not have to go all the way back to the Simulatics corporation to find controversies over the use of synthetic data in public opinion research. In 2014, political science graduate student Michael LaCour published in Science (with co-author Don Green) an article that purported to show the results of an experiment showing large and durable effects of deep canvassing conversations on attitudes towards gay rights. Trying to replicate the method, David Broockman, Josh Kalla, and PM Aronow uncovered irregularities in the data that suggested the data in the paper were likely fraudulent, leading to its retraction (including at the insistence of co-author Green). Broockman and Kalla went on to conduct a new, genuine study also published in Science that also showed large and lasting effects of the deep canvassing method.

If the topline conclusion, that deep canvassing works, was the same, does it matter if the data in the first study were likely fraudulent? Yes, because the patterns in the data differ in important ways. Broockman and Kalla found approximately equal effectiveness of trans and cis canvassers on trans-rights attitudes, while LaCour purported to find that only the effect of interacting with gay canvassers had lasting effects on gay rights attitudes. Those are differences that not only point to different political strategies, but also to different underlying mechanisms of effectiveness.

What cases of likely fraud show us is that simulated data, whether created to deceive researchers or created to assist us, can look quite similar to real data at first glance. But those superficial similarities do not make them useful in and of themselves, because these simulated data are reflections of our existing models of public opinion and human behavior. There is a fundamental difference between simulating data from a model based on assumptions about how public opinion works and conducting research to inform those assumptions. To conflate the two is to make a fundamental category error, and if we are to enrich models of public opinion – whether the models are formal, computational, or just the mental models in our heads – in ways that can surprise us and contradict our expectations, only true research will suffice.

Kevin Collins

The Limits of Simulation in Public Opinion Research

When not to use MMS for surveys