Ask a techspert: What is inference?

In April, we introduced Ironwood, our seventh-generation Tensor Processing Unit (TPU) designed to power the age of generative AI inference. TPUs — which are chips that power AI systems — are not new to Google, but this latest generation is different: It’s meant to take AI systems beyond being responsive and help them be proactive instead. And it will accomplish this thanks to inference, which is the process that allows AI systems to use models to make knowledge-based outputs.

To better understand this next era of AI computing, I asked senior product manager Niranjan Hira and distinguished engineer Fenghui Zhang to give me a crash course in inference.
I know what the word inference means — that, based on the information you’re given, you can come to some sort of conclusion. Is that in any way what it means when we’re talking about AI?
Niranjan: Kind of, yes. It’s an oversimplification, but I think it's easiest to understand inference as pattern matching. In the broadest sense when we’re talking about generative AI and inference, what we’re asking is: Can AI models match patterns to predict what you want? For example, if I said “peanut butter and ____” and asked an American audience to fill in the blank, they’d probably say “jelly.” That's a good example of inference for speech patterns, and that’s something that AI inference can do, but it goes way beyond that.
Fenghui: Inference in general is the way we actually use the model to do something useful. First, we have to train the model: An AI model will contain the model parameters, model architecture and configuration, which is the code that it needs to execute tasks — and these things combine to carry out the functionality. So inference is what allows us to actually take all that and use it.
What kinds of AI models use inference?
Fenghui: Deep learning AI like language models, image generation models and audio models all use inference because they’re making predictions for what’s going to “happen” based on what they’ve learned from past data patterns.
Niranjan: Recommendation models use inference, too.
What’s an example of a recommendation model?
Fenghui: Most ads models are recommendation models, and the model that recommends YouTube videos to you. These are “traditional” (sometimes called “classical”) AI — not generative AI such as LLMs and image or video generation models — that have been using inference for ages.
So inference isn’t new to AI, it’s just gotten better as AI has gotten more capable?
Fenghui: Yes. And inference isn’t just what allows AI models to predict. It’s what allows them to classify, too. The model can label things based on how it’s learned. Here’s a famous example: Many years ago, we gave an AI model a picture and asked if it could identify a cat in the image. Using data — and inference — it was able to teach itself what a cat is and what it looks like and correctly identify the cat.
I remember that!
Fenghui: That was an example of a model using inference.
Niranjan: More recently, do you remember a couple of years ago when people were talking about AI-created images that basically just ignored the laws of physics? People’s hands, for example, were often not depicted correctly. Models today do a much better job of that. They’re better at physics and texture, among other things. And the same thing goes for text translation. For instance: Language translation used to be statistical. It was usable but it wasn’t exactly right, and it certainly wasn’t conversational. But statistical translation led us to generative AI translation, which, today, lots of people feel comfortable using, even in their customer-facing products. We’re still using the process called inference, but the underlying AI and our computation capacity have improved dramatically.

Can you measure how well inference works?
Fenghui: We can when we measure how well a model performs at certain tasks. We also use inference to evaluate and train models to make them better. So while we train the model, we keep running inference to try to improve the model quality simultaneously.
And because of training setups like this, you’re seeing inference levels get better and better by industry benchmarks, I would think.
Niranjan: Yes. But there's also the question of human perception — how much have we all noticed these things getting better? And in general, it’s quite a lot. Something else we really care about at Google when we work on inference is privacy: We are careful about what we need to store for these experiences to work.
What are some examples of Google AI where we can see improved inference?
Fenghui: One of the best inference use cases we have at Google is AI Overviews. You type a query into Search and a very complex system farms it out to a bunch of models to try and get results back. It’s using inference to understand your query and to know what answer you want, and in the end, it summarizes what it learns into something very useful. Inference is also critical to a lot of the agentic work we’re doing. With agents, in addition to asking an AI model to deliver information based on its inference, you can make it do things for you. This is sort of an extension of inference as we formerly understood it.
So inference is getting better at using data, or knowledge, to offer answers and even take action. How else is it changing?
Fenghui: Well, one thing that’s super important is the cost. We're trying to make inference as affordable as possible. Let's say we’re trying to make a smaller, more affordable version of Gemini available to people. We would work on the model’s inference to find ways to change the computation paradigm, or the code that makes up the model, without changing the semantics, or the principal task it’s supposed to do, in order to reduce the cost. It’s basically making a smaller, more efficient version of the model so more people can access its capabilities.
How do we bring the cost down?
Fenghui: One way is to optimize the hardware. That’s why we have Ironwood coming out this year. It’s optimized for inference because of its inference-first design: more compute power and memory, and optimization for certain numeric types. Software-wise, we’re improving our compilers and our frameworks. Over time, we want AI inference to be more efficient. We want better quality, but we want a smaller footprint that costs less to be helpful.