DataGemma: Using real-world data to address AI hallucinations

Sep 12, 2024

DataGemma are the world’s first open models designed to help address the challenges of hallucination by grounding LLMs in the vast, real-world statistical data of Google's Data Commons.

Prem Ramaswami

Head of Data Commons

James Manyika

SVP, Technology & Society

Large language models (LLMs) powering today’s AI innovations are becoming increasingly sophisticated. These models can comb through vast amounts of text and generate summaries, suggest new creative directions and even draft code. However, as impressive as these capabilities are, LLMs sometimes confidently present information that is inaccurate. This phenomenon, known as "hallucination," is a key challenge in generative AI.

Today we're sharing promising research advancements that tackle this challenge directly, helping reduce hallucination by anchoring LLMs in real-world statistical information. Alongside these research advancements, we are excited to announce DataGemma, the first open models designed to connect LLMs with extensive real-world data drawn from Google's Data Commons.

Data Commons: A vast repository of publicly available, trustworthy data

Data Commons is a publicly available knowledge graph containing over 240 billion rich data points across hundreds of thousands of statistical variables. It sources this public information from trusted organizations like the United Nations (UN), the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC) and Census Bureaus. Combining these datasets into one unified set of tools and AI models empowers policymakers, researchers and organizations seeking accurate insights.

Think of Data Commons as a vast, constantly expanding database filled with reliable, public information on a wide range of topics, from health and economics to demographics and the environment, which you can interact with in your own words using our AI-powered natural language interface. For example, you can explore which countries in Africa have had the greatest increase in electricity access, how income correlates with diabetes in US counties or your own data-curious query.

How Data Commons can help tackle hallucination

As generative AI adoption is increasing, we’re aiming to ground those experiences by integrating Data Commons within Gemma, our family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models. These DataGemma models are available to researchers and developers starting now.

DataGemma will expand the capabilities of Gemma models by harnessing the knowledge of Data Commons to enhance LLM factuality and reasoning using two distinct approaches:

1. RIG (Retrieval-Interleaved Generation) enhances the capabilities of our language model, Gemma 2, by proactively querying trusted sources and fact-checking against information in Data Commons. When DataGemma is prompted to generate a response, the model is programmed to identify instances of statistical data and retrieve the answer from Data Commons. While the RIG methodology is not new, its specific application within the DataGemma framework is unique.

Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RIG methodology leverages Data Commons (DC) for authoritative data.

2. RAG (Retrieval-Augmented Generation) enables language models to incorporate relevant information beyond their training data, absorb more context, and enable more comprehensive and informative outputs. With DataGemma, this was made possible by leveraging Gemini 1.5 Pro’s long context window. DataGemma retrieves relevant contextual information from Data Commons before the model initiates response generation, thereby minimizing the risk of hallucinations and enhancing the accuracy of responses.

Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RAG methodology showcases greater reasoning and inclusion of footnotes.

Promising results and future directions

Our preliminary findings using RIG and RAG are early, but encouraging. We've observed notable enhancements in the accuracy of our language models when handling numerical facts. This suggests that users will experience fewer hallucinations for use cases across research, decision-making or simply satisfying curiosity. Explore these results in our research paper.

Illustration of a RAG query and response. Supporting ground truth statistics are referenced as tables served from Data Commons. *Partial response shown for brevity.

a black screen reading "What progress has Pakistan made against health goals?" and "Rag answer example"

Our research is ongoing, and we’re committed to refining these methodologies further as we scale up this work, subject it to rigorous testing, and ultimately integrate this enhanced functionality into both Gemma and Gemini models, initially through a phased, limited-access approach.

By sharing our research and making this latest Gemma model variant an “open” model once again, we aspire to facilitate the broader adoption of these Data Commons-led techniques for grounding LLMs in factual data. Making LLMs more reliable and trustworthy is key to ensuring they are indispensable tools for everyone, and building a future where AI empowers people with accurate information, fostering informed decisions, and a deeper understanding of the world around us.

Researchers and developers can also get started with DataGemma using these quickstart notebooks for both the RIG and RAG approaches. To learn more about how Data Commons and Gemma work together, read our Research post.

POSTED IN:

DataGemma: Using real-world data to address AI hallucinations

Data Commons: A vast repository of publicly available, trustworthy data

How Data Commons can help tackle hallucination

Promising results and future directions

Related stories