Google’s new tool lets large language models fact-check their responses

It could assist the company in its efforts to embed AI in more and more of its products.

James O'Donnellarchive page

September 12, 2024

Stephanie Arnett/MIT Technology Review | Getty, rawpixel

As long as chatbots have been around, they have made things up. Such “hallucinations” are an inherent part of how AI models work. However, they’re a big problem for companies betting big on AI, like Google, because they make the responses it generates unreliable.

Google is releasing a tool today to address the issue. Called DataGemma, it uses two methods to help large language models fact-check their responses against reliable data and cite their sources more transparently to users.

The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question—like “Has the use of renewable energy sources increased in the world?”—the model will come up with a “first draft” answer. Then RIG identifies what portions of the draft answer could be checked against Google’s Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.

The second method, which is commonly used in other large language models, is called Retrieval-Augmented Generation (RAG). Consider a prompt like “What progress has Pakistan made against global health goals?” In response, the model examines which data in the Data Commons could help it answer the question, such as information about access to safe drinking water, hepatitis B immunizations, and life expectancies. With those figures in hand, the model then builds its answer on top of the data and cites its sources.

“Our goal here was to use Data Commons to enhance the reasoning of LLMs by grounding them in real-world statistical data that you could source back to where you got it from,” says Prem Ramaswami, head of Data Commons at Google. Doing so, he says, will “create more trustable, reliable AI.”

It is only available to researchers for now, but Ramaswami says access could widen further after more testing. If it works as hoped, it could be a real boon for Google’s plan to embed AI deeper into its search engine.

However, it comes with a host of caveats. First, the usefulness of the methods is limited by whether the relevant data is in the Data Commons, which is more of a data repository than an encyclopedia. It can tell you the GDP of Iran, but it’s unable to confirm the date of the First Battle of Fallujah or when Taylor Swift released her most recent single. In fact, Google’s researchers found that with about 75% of the test questions, the RIG method was unable to obtain any usable data from the Data Commons. And even if helpful data is indeed housed in the Data Commons, the model doesn’t always formulate the right questions to find it.

Second, there is the question of accuracy. When testing the RAG method, researchers found that the model gave incorrect answers 6% to 20% of the time. Meanwhile, the RIG method pulled the correct stat from Data Commons only about 58% of the time (though that’s a big improvement over the 5% to 17% accuracy rate of Google’s large language models when they’re not pinging Data Commons).

Ramaswami says DataGemma’s accuracy will improve as it gets trained on more and more data. The initial version has been trained on only about 700 questions, and fine-tuning the model required his team to manually check each individual fact it generated. To further improve the model, the team plans to increase that data set from hundreds of questions to millions.

Deep Dive

Artificial intelligence

Everyone in AI is talking about Manus. We put it to the test.

The new general AI agent from China had some system crashes and server overload—but it’s highly intuitive and shows real promise for the future of AI helpers.

Caiwei Chenarchive page

Anthropic can now track the bizarre inner workings of a large language model

What the firm found challenges some basic assumptions about how this technology really works.

Will Douglas Heavenarchive page

China built hundreds of AI data centers to catch the AI boom. Now many stand unused.

The country poured billions into AI infrastructure, but the data center gold rush is unraveling as speculative investments collide with weak demand and DeepSeek shifts AI trends.

Caiwei Chenarchive page

AI reasoning models can cheat to win chess games

These newer models appear more likely to indulge in rule-bending behaviors than previous generations—and there’s no way to stop them.

Rhiannon Williamsarchive page

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Google’s new tool lets large language models fact-check their responses

Deep Dive

Artificial intelligence

Everyone in AI is talking about Manus. We put it to the test.

Anthropic can now track the bizarre inner workings of a large language model

China built hundreds of AI data centers to catch the AI boom. Now many stand unused.

AI reasoning models can cheat to win chess games

Stay connected

Get the latest updates from
MIT Technology Review

The latest iteration of a legacy

Advertise with MIT Technology Review

About

Help

Deep Dive

Artificial intelligence

Everyone in AI is talking about Manus. We put it to the test.

Anthropic can now track the bizarre inner workings of a large language model

China built hundreds of AI data centers to catch the AI boom. Now many stand unused.

AI reasoning models can cheat to win chess games

Stay connected

Get the latest updates fromMIT Technology Review

Get the latest updates from
MIT Technology Review