Security

Mitigating prompt injection attacks with a layered defense strategy

Jun 13, 2025

Google GenAI Security Team

With the rapid adoption of generative AI, a new wave of threats is emerging across the industry with the aim of manipulating the AI systems themselves. One such emerging attack vector is indirect prompt injections. Unlike direct prompt injections, where an attacker directly inputs malicious commands into a prompt, indirect prompt injections involve hidden malicious instructions within external data sources. These may include emails, documents, or calendar invites that instruct AI to exfiltrate user data or execute other rogue actions. As more governments, businesses, and individuals adopt generative AI to get more done, this subtle yet potentially potent attack becomes increasingly pertinent across the industry, demanding immediate attention and robust security measures.

At Google, our teams have a longstanding precedent of investing in a defense-in-depth strategy, including robust evaluation, threat analysis, AI security best practices, AI red-teaming, adversarial training, and model hardening for generative AI tools. This approach enables safer adoption of Gemini in Google Workspace and the Gemini app (we refer to both in this blog as “Gemini” for simplicity). Below we describe our prompt injection mitigation product strategy based on extensive research, development, and deployment of improved security mitigations.

A layered security approach

Google has taken a layered security approach introducing security measures designed for each stage of the prompt lifecycle. From Gemini 2.5 model hardening, to purpose-built machine learning (ML) models detecting malicious instructions, to system-level safeguards, we are meaningfully elevating the difficulty, expense, and complexity faced by an attacker. This approach compels adversaries to resort to methods that are either more easily identified or demand greater resources.

Our model training with adversarial data significantly enhanced our defenses against indirect prompt injection attacks in Gemini 2.5 models (technical details). This inherent model resilience is augmented with additional defenses that we built directly into Gemini, including:

Prompt injection content classifiers
Security thought reinforcement
Markdown sanitization and suspicious URL redaction
User confirmation framework
End-user security mitigation notifications

This layered approach to our security strategy strengthens the overall security framework for Gemini – throughout the prompt lifecycle and across diverse attack techniques.

1. Prompt injection content classifiers

Through collaboration with leading AI security researchers via Google's AI Vulnerability Reward Program (VRP), we've curated one of the world’s most advanced catalogs of generative AI vulnerabilities and adversarial data. Utilizing this resource, we built and are in the process of rolling out proprietary machine learning models that can detect malicious prompts and instructions within various formats, such as emails and files, drawing from real-world examples. Consequently, when users query Workspace data with Gemini, the content classifiers filter out harmful data containing malicious instructions, helping to ensure a secure end-to-end user experience by retaining only safe content. For example, if a user receives an email in Gmail that includes malicious instructions, our content classifiers help to detect and disregard malicious instructions, then generate a safe response for the user. This is in addition to built-in defenses in Gmail that automatically block more than 99.9% of spam, phishing attempts, and malware.

A diagram of Gemini’s actions based on the detection of the malicious instructions by content classifiers.

This diagram illustrates a content moderation flow where user input is evaluated for malicious instructions. Depending on whether malicious instructions are found in all, some, or none of the components (such as files or emails), the system responds in three ways: completely blocking the response, filtering the unsafe content from the response, or allowing the response unchanged. The use of colors (red for "bad," yellow for "caution/partial," green for "good") and iconographic shapes (X, ✓, warning) makes the flow intuitive and clear.

2. Security thought reinforcement
This technique adds targeted security instructions surrounding the prompt content to remind the large language model (LLM) to perform the user-directed task and ignore any adversarial instructions that could be present in the content. With this approach, we steer the LLM to stay focused on the task and ignore harmful or malicious requests added by a threat actor to execute indirect prompt injection attacks.

A diagram of Gemini’s actions based on additional protection provided by the security thought reinforcement technique.

This diagram illustrates a multi-layered content moderation flow. First, user input is evaluated for malicious instructions. Depending on whether these are found in all, some, or none of the components (such as files or emails), the system responds in three ways: blocking, filtering, or allowing. However, even responses initially deemed "safe" undergo a second layer of "security thinking reinforcement," where an additional reasoning step determines whether the final response should be blocked or allowed. The use of colors (red for blocked, yellow for partial, green for successful) and iconographic shapes makes the flow intuitive and clear.

3. Markdown sanitization and suspicious URL redaction
Our markdown sanitizer identifies external image URLs and will not render them, making the “EchoLeak” 0-click image rendering exfiltration vulnerability not applicable to Gemini. From there, a key protection against prompt injection and data exfiltration attacks occurs at the URL level. With external data containing dynamic URLs, users may encounter unknown risks as these URLs may be designed for indirect prompt injections and data exfiltration attacks. Malicious instructions executed on a user's behalf may also generate harmful URLs. With Gemini, our defense system includes suspicious URL detection based on Google Safe Browsing to differentiate between safe and unsafe links, providing a secure experience by helping to prevent URL-based attacks. For example, if a document contains malicious URLs and a user is summarizing the content with Gemini, the suspicious URLs will be redacted in Gemini’s response.

Gemini in Gmail provides a summary of an email thread. In the summary, there is an unsafe URL. That URL is redacted in the response and is replaced with the text “suspicious link removed”.

4. User confirmation framework
Gemini also features a contextual user confirmation system. This framework enables Gemini to require user confirmation for certain actions, also known as “Human-In-The-Loop” (HITL), using these responses to bolster security and streamline the user experience. For example, potentially risky operations like deleting a calendar event may trigger an explicit user confirmation request, thereby helping to prevent undetected or immediate execution of the operation.

The Gemini app with instructions to delete all events on Saturday. Gemini responds with the events found on Google Calendar and asks the user to confirm this action.

The image captures a confirmation dialog moment. The user requested to delete all events from a specific Saturday, and the assistant responded by identifying the three relevant events and displaying their details so the user could confirm the action before it was performed.

5. End-user security mitigation notifications
A key aspect to keeping our users safe is sharing details on attacks that we’ve stopped so users can watch out for similar attacks in the future. To that end, when security issues are mitigated with our built-in defenses, end users are provided with contextual information allowing them to learn more via dedicated help center articles. For example, if Gemini summarizes a file containing malicious instructions and one of Google’s prompt injection defenses mitigates the situation, a security notification with a “Learn more” link will be displayed for the user. Users are encouraged to become more familiar with our prompt injection defenses by reading the Help Center article.

Gemini in Docs with instructions to provide a summary of a file. Suspicious content was detected and a response was not provided. There is a yellow security notification banner for the user and a statement that Gemini’s response has been removed, with a “Learn more” link to a relevant Help Center article.

The image captures an interaction where a user asked Gemini to summarize a file, but the AI detected unsafe content (suspicious prompts or links) and therefore blocked the response, displaying an explicit security warning instead. The rest of the interface shows the standard controls for interacting and sending new prompts.

Moving forward

Our comprehensive prompt injection security strategy strengthens the overall security framework for Gemini. Beyond the techniques described above, it also involves rigorous testing through manual and automated red teams, generative AI security BugSWAT events, strong security standards like our Secure AI Framework (SAIF), and partnerships with both external researchers via the Google AI Vulnerability Reward Program (VRP) and industry peers via the Coalition for Secure AI (CoSAI). Our commitment to trust includes collaboration with the security community to responsibly disclose AI security vulnerabilities, share our latest threat intelligence on ways we see bad actors trying to leverage AI, and offering insights into our work to build stronger prompt injection defenses.

Working closely with industry partners is crucial to building stronger protections for all of our users. To that end, we’re fortunate to have strong collaborative partnerships with numerous researchers, such as Ben Nassi (Confidentiality), Stav Cohen (Technion), and Or Yair (SafeBreach), as well as other AI Security researchers participating in our BugSWAT events and AI VRP program. We appreciate the work of these researchers and others in the community to help us red team and refine our defenses.

We continue working to make upcoming Gemini models inherently more resilient and add additional prompt injection defenses directly into Gemini later this year. To learn more about Google’s progress and research on generative AI threat actors, attack techniques, and vulnerabilities, take a look at the following resources:

Beyond Speculation: Data-Driven Insights into AI and Cybersecurity (RSAC 2025 conference keynote) from Google’s Threat Intelligence Group (GTIG)
Adversarial Misuse of Generative AI (blog post) from Google’s Threat Intelligence Group (GTIG)
Google's Approach for Secure AI Agents (white paper) from Google’s Secure AI Framework (SAIF) team
Advancing Gemini's security safeguards (blog post) from Google’s DeepMind team
Lessons from Defending Gemini Against Indirect Prompt Injections (white paper) from Google’s DeepMind team

POSTED IN:

Mitigating prompt injection attacks with a layered defense strategy

A layered security approach

Moving forward

Related stories