Unlock multimodal search at scale: Combine text & image power with Vertex AI

Unlock multimodal search at scale: Combine text & image power with Vertex AI

The way users search is evolving. When searching for a product, users might type in natural-sounding language or search with images. In return, they want tailored results that are specific to their query. To meet these demands, developers need robust multimodal search systems.

In this blog post, we’ll share a powerful approach to build a multimodal search engine using Google Cloud’s Vertex AI platform. We’ll combine the strengths of Vertex AI Search and vector search, using an ensemble method with weighted Rank-Biased Reciprocal Rank (RRF). This approach allows for:

  • Improved user experience: Searching becomes more intuitive and less reliant on finding the “perfect” keywords.
  • Enhanced product discovery: Users can uncover items they might not have found with text alone.
  • Higher conversion rates: More relevant and engaging search results lead to happier customers and increased sales.
aside_block
<ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud AI and ML'), ('body', <wagtail.rich_text.RichText object at 0x3ed61cd977c0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/'), ('image', None)])]>

Why using a combined approach matters

Think about how you search for products online. Assume you want to search for queries such as “homes with a large backyard” or “white marble countertops”. Some of this information might be stored in text, while others might only be available in images. When you search for a product, you want the system to look through both modalities. 

One approach might be to ask a Large language model (LLM) to generate a text description of an image. But this can be cumbersome to manage over time and add latency for your users. Instead, we can leverage image embeddings and combine the search results with text data in Vertex AI Search. Together, this multimodal approach delivers: 

  • Richer visual understanding: Multi-modal embeddings capture the complex visual features and relationships within images, going beyond simpler text annotations.

  • Image-based queries: Users can directly search using an image, allowing for more intuitive discovery based on visual inspiration.

  • Precise filtering: Filtering by detailed attributes like size, layout, materials, and features becomes possible, leading to highly accurate search and curated results.

Google Cloud’s Vertex AI platform provides a comprehensive set of tools for building and deploying machine learning solutions, including powerful search capabilities:

  • Vertex AI search: A highly scalable and feature-rich engine for many types of search. It supports advanced features like faceting, filtering, synonyms, and custom relevance ranking. It also enables advanced document parsing including unstructured documents (PDFs) and even those with embedded graphics (e.g. tables, infographics, etc.) 

  • Vertex AI multimodal embedding API: This is used to generate image embeddings (numerical representations of images).

  • Vertex AI Vector Search: This is used as the vector database to store the embeddings with metadata information for searching. It can store both sparse embeddings, e.g. text descriptions, and dense embeddings, e.g. images. 

Our ensemble approach: Text + image power

To create our multimodal search engine, we’ll use an ensemble approach that combines the strengths of Vertex AI Search and vector search for images:

  1. Text search with Vertex AI Search:

    • Index your product catalog data (names, descriptions, attributes) into a data store using agent builder.

    • When a user enters a text query, Vertex AI Search returns relevant products based on keyword matching, semantic understanding, and any custom ranking rules you’ve defined.

    • This also has capabilities to return facets which can further be used for filtering. 

    • You can even visualize how unstructured or complex documents are parsed and chunked

  2. Image search with vector embeddings:

    • Generate image embeddings for your products using multimodal embeddings API.

    • Store these embeddings in vector search.

    • When a user uploads an image or text, convert it to an embedding and query the vector database to find visually similar product images.

  3. Combining results with weighted RRF:

    • Rank-biased Reciprocal Rank (RRF): This metric measures the relevance of a ranked list by considering the position of the first relevant item. It favors lists where relevant items appear higher.

    • Weighted RRF: Assign weights to the text relevance score (from Vertex AI Search) and the image similarity score (from vector search). This allows you to adjust the importance of each modality (i.e. Vertex or Vector Search) in the final ranking.

    • Ensemble: Combine the text and image search results, re-rank them using the weighted RRF score, and present the blended list to the user.

image1

To enhance the search experience, use Vertex AI Agent Builder Search’s faceting capabilities:

  • Define facets: Based on your product data, create facets for categories, attributes (color, size, material), price ranges, etc.

  • Dynamic filtering: Allow users to interactively refine their searches using these facets, narrowing down the results to the most relevant products. The filters adjust automatically based on the returned results (hence “dynamic”) 

  • Natural language query understanding: If the textual data is structured then you can enable natural language query understanding in your Vertex AI Agent Builder Search to improve results of the query. You can then parse the filters from the response to apply the same filters to the vector search using namespaces.

Why this approach works

This approach gives developers the best of both worlds by combining the rich features of Vertex AI Search (for example, the parsing pipeline) with the ability to directly utilize images as a query. It’s also flexible and customizable because it adjusts the weights in your RRF ensemble and tailors facets to your specific needs.

Above all, this approach gives your users what they need – the ability to search intuitively using text, images, or both, while offering dynamic filtering options for refined results.

Get started with multi-modal search

By leveraging the power of Vertex AI and combining text and image search with a robust ensemble method, you can build a highly effective and engaging search experience for your users. Get started: 

  1. Explore Vertex AI: Dive into the documentation and explore the capabilities of Vertex AI Search and embedding generation.

  2. Experiment with embeddings: Test different image embedding models and fine-tune them on your data if needed.

  3. Implement weighted RRF: Design your scoring function and experiment with different weights to optimize your search results.

  4. Natural language query understanding: Leverage the inbuilt capabilities of Vertex AI agent builder Search to generate filters on structured data to apply the same filters to Vector Search.

  5. Filters in vector search: Apply filters to your image embeddings to further give control to the users.