Picture of a man using audio dictation

Multimodal search: Why it should be part of your 2026 SEO and GEO strategy

Search engines and AI models no longer need you to type exact phrases. They can process a spoken description, an uploaded photo and even a short video clip alongside text or audio to understand intent.

Shayna Burns

12 November 2025

7 minute read

Recently in Australia, Google has been running ads demonstrating how the new Pixel 10 smartphone (with Gemini Live) allows people to ask a question by sharing a photo or video and asking a question either verbally or written:

Picture of a Google AI ad

Screenshot of a Google Pixel 10 phone with a mock search experience using multimodal search. In this experience, a user has shared a photo of a chalet-style building with the voice prompt, “What kind of wedding style would fit this venue best?”. Google Gemini Live has responded “A mountain or forest theme would be perfect. Is your plan to go vintage or modern?”, signalling the opportunity for a continued conversational search experience.

Another example of a Google ad you may have seen on free-to-air TV involves a woman holding her phone up to a display case of sunglasses, stating that she has a heart-shaped face and asking which of the sunglasses would look best.

Google has also been promoting how its AI Mode will allow users to conduct a search using a combination of video and voice:

A voice and visual search using Google Lens

An example of how a user has opened Google Lens to show Gemini the books on their bookcase, asking “If I enjoyed these, what are some similar books that are highly rated?”, demonstrating the combination of voice and visual search.

The process of using multiple search inputs (text, voice, video, photo) is called multimodal search, and it’s one of the most natural ways we query and look for information.

This article explains more about what multimodal search is, why it’s an integral part of SEO and GEO strategies (and something we encourage for 2026), and how you can optimise your assets in preparation for adoption.

What is multimodal search?

Multimodal search is the ability to query using a combination of text, voice, images and video instead of relying only on keywords.

Search engines and AI models no longer need you to type exact phrases. They can process a spoken description, an uploaded photo and even a short video clip alongside text to understand intent.

Examples you may already know:

  • Google Lens: Upload an image to identify a plant, landmark or product.
  • Voice search: “Hey Google, where’s the nearest late-night chemist?”
  • AI models: GPT-4o and Gemini can accept text, images and voice in a single conversation.

Behind the scenes, AI systems are able to answer these multi-input queries by using techniques like retrieval-augmented generation (RAG) – combining user input with external data sources – to ground their answers. (Those of you readers looking for a more technical breakdown might enjoy this YouTube video from Google on how to build multimodal RAG.)

In short, multimodal is the machine world catching up to how humans naturally ask questions.

Replicating our natural way of inquiring

This new search experience allows people to transition from keywords to context-based searching, modelling our natural behaviour and reducing overall friction:

  • From keywords: “cheap flights Melbourne Tokyo”
  • To conversational queries: “What’s the cheapest way to fly from Melbourne to Tokyo in the next three months?”
  • To multimodal queries: [holds up photo of passport] “What visa do I need to visit Japan with this document?”

Ultimately, this progression lowers the barrier for individuals asking a question. You don’t need the right words – you just need to show or describe what you mean to get a suitable answer.

A search strategy for all sectors

It’s easy to dismiss multimodal search as a retail gimmick, since most things we might photograph or record are things, and things are usually commercialised products. But the implications – and opportunities – for this go far wider. 

Here are use cases across industries:

Travel: Upload a photo of a beach and ask, “Find me somewhere like this in Asia.”
This shortens the customer decision cycle by removing friction from inspiration to booking.

Higher education: A prospective student takes a photo of a course brochure and asks, “What are the career pathways from this program?”
This allows the university to surface official program details, testimonials and alumni outcomes directly in search, improving recruitment and reducing the drop-off between interest and application.

Healthcare: Take a photo of a rash and ask, “Is this rash serious enough to go to the hospital?”
This supports early triage and improves patient experience, though regulation is critical.

Customer support: Point your camera at a bill and ask, “What’s this fee?”
This increases customer trust and reduces call centre volume.

Public sector: Snap a broken street sign and report it directly to the local council.
This lowers reporting friction and improves citizen engagement.

The thread running through all of these examples is that multimodal search makes discovery and problem-solving more human.

Multimodal isn’t a niche experiment. It’s becoming a common and popular way people discover information and therefore is becoming more central to the GEO strategies we create.

Why this shift matters 

According to an article by DemandSage, 20.5 percent of people globally already use voice search, and there are an estimated 8.4 billion voice-enabled assistants in use – more than the global population itself. Meanwhile, the visual search market was valued at about US$41.7 billion in 2024 and is projected to hit US$151.6 billion by 2032.

Behind this, four forces are colliding:

1. Consumer behaviour: People are comfortable using voice (thanks to Siri, Alexa, Google Assistant) and visuals (thanks to TikTok, Instagram and Pinterest). Asking with a photo or a spoken question is already normal.

2. Technology capability: Generative AI models are natively multimodal. OpenAI’s GPT-4o, Google’s Gemini, and Anthropic’s Claude can all parse images, video and voice. Search engines are aligning with that reality. 

3. Accessibility:

  • People with low literacy (voice search > written search)
  • People with visual impairments (voice and audio content > images)
  • People with language barriers (showing an image > describing it in unfamiliar vocabulary)

4. SEO benefits: According to Brightedge, web pages with videos are 53 percent more likely to rank on the first page of Google.

Multimodal isn’t a niche experiment. It’s becoming a common and popular way people discover information and therefore is becoming more central to the generative engine optimisation (GEO) strategies we create.

That’s why businesses need to act now. The richer your content library is across text, images, video and audio, the more likely it is to be pulled into an answer when someone searches multimodally.

If your content is only optimised for text queries, you risk becoming less visible in the next wave of search.

Why having diverse content is critical 

Optimisation is wasted if your brand doesn’t produce images, video and audio. Diverse content creation is the entry ticket to multimodal search – without it, you won’t even be in the game.

To succeed, you need to create images, video and audio assets that AI systems can recognise and surface – based on what your target audience is most likely to search for, of course.

  • Images: Product photography, diagrams, infographics and contextual lifestyle shots all feed visual search engines like Google Lens.
  • Video: Walkthroughs, demonstrations and explainers often answer “show me” and “how to” queries better than text ever can.
  • Audio: Podcasts, interviews and recorded snippets open doors to voice-led discovery, especially when paired with transcripts.

Text is still critical for structure and context, but in a multimodal world, text is the scaffolding – images, video and audio are the assets that get surfaced.

How to optimise for multimodal search

The good news is making your digital assets friendly for multimodal search doesn’t mean rebuilding your digital presence – rather, it’s about finally implementing best practices that SEO, content and UX specialists have already been recommending:

1. Make your visuals machine-readable

  • Use descriptive alt text and clear file names
  • Provide structured data (e.g. Product, HowTo) to explain what’s in your images
  • Avoid uploading graphics without context (e.g. text baked into an image with no accompanying HTML)

2. Make your audio and video searchable

  • Always provide transcripts and captions
  • Add schema markup (e.g. VideoObject, PodcastEpisode)
  • Ensure key takeaways are present in both the audio/video and the text description

3. Optimise for voice and natural language

  • Include FAQs and conversational content on your site
  • Think about the kinds of ‘show and tell’ queries people might make in relation to your business
  • Write in a way that could be cleanly quoted by a search engine or LLM

Bonus: OpenAI’s Cookbook on Multimodal is a good resource for developers interested in building multimodal experiences.

Multimodal search in summary

Google’s recent ads are more than hype – they’re a signal that search no longer needs to be just about typing text. 

The businesses that build multimodal search strategies now will be more discoverable, relevant and trusted in an AI-driven world.

The question isn’t whether your customers will use voice, image or video to search; it’s whether your brand will be ready when they do.

Key takeaways

  • Multimodal search is here: Voice, image and video queries (and their combinations) are already normalised, and multimodal’s adoption will continue to grow in 2026.
  • It’s cross-industry: Don’t dismiss it as being for e-commerce only. It can apply to travel, higher education, healthcare, finance and beyond.
  • Strategy, not gimmick: Treat it as a shift in search behaviour, not a feature.
  • Get discoverable: Optimise your visuals, transcripts and conversational content so machines – and people – can find you.


An abridged version of this article originally appeared in Marketing mag

Want to know how well your current content is optimised for LLMs and multimodal search?

We'd be happy to discuss a GEO audit for your website. 

Get in touch

Keep Reading

Want more? Here are some other blog posts you might be interested in.