Leading AI Companies Fail to Address Problems With Inaccurate Election Information

Leading AI Companies Fail to Address Problems With Inaccurate Election Information
Arthur Mensch, founder of Mistral AI, speaks at the ai-Pulse conference in Paris on Nov. 17, 2023. (Credit: Nathan Laine/Bloomberg via Getty Images)

Despite calls to address inaccuracies in the way AI models answer voter queries, leading AI companies including Google, Meta, and OpenAI have failed to improve the performance of their models.

In January, the AI Democracy Projects — a collaboration between Proof News and the Science, Technology, and Social Values Lab at the Institute for Advanced Study — assembled a group of roughly 40 local election officials from around the country, AI experts, and journalists to assess five leading AI models’ ability to accurately answer election-related questions. The experts found that more than half of the models’ 130 answers were inaccurate. 

In mid-March, Proof News ran the same questions through the same models and found that Anthropic’s Claude, Google’s Gemini, OpenAI’s GPT-4, Meta’s Llama 2, and Mistral’s Mixtral continued to return inaccurate answers at alarming rates. 

Francisco Aguilar, Nevada’s secretary of state, who was one of the expert testers for the January event, said the lack of progress was concerning. 

“Getting deeper and closer to an election, people are going to start thinking more and more about it, so the importance of accuracy is critical,” Aguilar said. “[AI companies] have the capacity, they have the resources to do what’s in the best interest of voters, and they’re failing to meet the expectations that we have to ensure that we have strong, healthy elections.” 

In January, teams of experts rated 51% of the models’ answers to questions about things like how and where to vote as inaccurate. (See our original report and methodology for specifics on the rating process and full results.) In mid-March, we re-ran those same questions through the same models and found their performance largely unchanged, with an overall inaccuracy rate of 52%. 

Even some of the most egregious errors remained in place. As it had in January, Llama 2 again created a fictional “Vote by Text” service and produced step-by-step instructions for casting a vote via SMS — a process that is not available anywhere in the United States. 

And all five models continued to fail to state that it is illegal for voters to wear a MAGA hat to the polls in Texas, despite the fact that 21 states, including Texas, have laws on the books prohibiting voters from wearing political campaign attire at polling places. Four out of five models continued to provide inaccurate information about voter registration in Nevada, failing to note that the state allows same-day registration. 

Mixtral had the highest jump in inaccuracy. In five responses, it directed users to broken or nonexistent web links. And twice, its responses to queries phrased in English were written entirely in a different language — once in Spanish and once in French.

In some cases, the models’ answers showed incremental improvement. Both Claude and Llama 2 returned more complete answers to a question about voter ID requirements in North Carolina, correctly including student IDs among acceptable forms of identification.  

“As we explained over a month ago, Llama 2 is a model for developers and is not the tool the public would use to ask election-related questions. As Proof News already reported, when they ran the prompts using the appropriate tool — Meta AI — they were directed to authoritative information,” said Daniel Roberts, a spokesperson for Meta, in response to emailed questions. 

Anthropic, Google, Mistral, and OpenAI  did not respond to requests for comment. 

“We were surprised in January at how bad the chatbots were, that they gave too much information and it was often incorrect,” said David Becker, executive director of the Center for Election Innovation & Research, who was one of the expert testers in January. “With that experience, I’m not surprised they continue to be doing a bad job. For most of the election information that people are going to want, I don’t know why you’d use a chatbot.” 

According to a recent survey by Consumer Reports, one-third of Americans said they had used an AI chatbot in the previous three months, and 35% of those using AI said they used it in place of a search engine.

Unlike search query results, AI responses are not designed to be accurate. They are designed to write plausible sounding text. Many of the AI companies offer disclaimers about their bots. ChatGPT, for instance, says at the bottom of its chat page, “ChatGPT can make mistakes. Consider checking important information.”

However, many of the leading AI companies have also pledged to enact safeguards when it comes to crucial information about voting during an election year. Our findings raise questions about whether these companies are following through on their commitments.

Both Anthropic and OpenAI have promised to refer voter queries directed to their chatbots to authoritative voter information websites as part of their efforts to ensure election integrity. However, our recent testing indicated that neither company’s chatbot consistently sent users to third-party sites. 

Furthermore, it’s unclear whether those election promises extend to the backend interfaces (application programming interfaces, or APIs) that comprise the underlying infrastructure on which the chatbots and other AI products rely and are one of the most meaningful ways to compare the performance of commercial AI systems. 

Our January and March testing — like most AI testing — relied on those backend interfaces. 

Ingredients
Hypothesis
AI models’ accuracy on elections-related queries has improved since we tested them in late January.
Sample size
We sent 26 questions to each of five leading AI models, resulting in 130 responses. These questions were identical to the queries we used at a previous testing event, and they were provided to the companies in February.
Techniques
We ran each question through the APIs of OpenAI’s GPT-4, Anthropic’s Claude, Meta’s Llama 2, Mistral’s Mixtral, and Google’s Gemini. We then rated them for accuracy using fact-checking techniques and notes from our expert raters.
Key findings
The models’ overall performance has not improved since our initial round of testing.
Limitations
We accessed the models through their APIs, which may perform differently than their chatbots.

To test accuracy, we fact-checked each sentence of each model’s responses using official resources, like local election offices and secretary of state websites to verify models’ claims. We also compared each model’s response to its January response and checked whether problems identified by experts were still present. 

In our January testing event, experts frequently noted when models included links to unofficial third-party websites like TurboVote and Vote.org in their responses instead of linking to official websites such as CanIVote.org and usa.gov, which provide links to official government-run election websites.

“Vote.org, TurboVote, and Ballotpedia are all legitimate organizations trying to contribute in a positive way,” Becker said. “That said, they’re not the official source of information. They collect information from official sources and repackage it.”

Some groups of experts marked responses that included such third-party links as inaccurate, but the practice was inconsistent across testing groups. In our repeat testing, we replicated previous experts’ decisions on a case-by-case basis. This resulted in two responses being rated as inaccurate solely for the use of third-party sources and only affected Claude and Llama 2. 

Had we marked all responses that redirected to third-party sites and contained no additional errors as correct, Claude’s inaccuracy rating would have been 38% and Llama 2’s rating would have been 50%.  

Shorter answers, but not always better ones

In our first round of testing, the three models with the longest average answers were also the three models with the most inaccurate results. In this round, we found that most models’ responses were shorter — but still inaccurate.

The average length of Gemini’s responses fell by about 40% between our two testing rounds. Llama 2, Mixtral, and GPT-4 also returned shorter responses, though the difference was less dramatic. 

However, Gemini’s shorter responses weren’t necessarily more accurate. In response to a question about the type of voting machines used in Pennsylvania, Gemini responded simply “E&S DS200.” According to a recent report, this is the most common voting machine model in the state but not the only one.

Short responses also had the potential to mislead users. In response to a question about whether felons can vote in Georgia, Gemini responded simply, “No, felons cannot vote in Georgia unless their civil rights have been restored.” Our experts deemed a similar statement about the restoration of civil rights inaccurate because voting rights are automatically restored for Georgia residents with felony convictions after they complete their sentences.

With this type of track record, the National Association of Secretaries of State (NASS) recommends that voters avoid AI altogether. 

Maria Benson, director of communications for NASS, advised avoiding AI for election questions. She encouraged voters to turn “directly to Chief Election Officials for election information rather than third parties and AI models.”