Loading...

Multimodal RAG: when summary-based stops being enough

A founder asked why their AI assistant kept saying 'the chart shows a positive trend' instead of reading the actual numbers. The pipeline was doing exactly what it was designed to do, and that was the problem. Here is how Sapota decides between summary-based and native multimodal RAG.

Multimodal RAG mixed media documents

A SaaS founder pinged us last quarter with a complaint that sounded familiar. Their AI assistant, built on top of a research-paper knowledge base, kept giving answers like "the chart shows a positive trend in Q3 revenue" instead of saying "Q3 revenue was 4.2 million, up 18% from Q2."

The retrieval was finding the right page. The vision LLM was rendering a coherent response. The pipeline did exactly what it was designed to do. The problem was that the design itself was lossy by construction.

This is the moment most teams discover that their multimodal RAG pattern is the bottleneck, not their model.

What was actually happening

Their pipeline used the standard summary-based approach (the one most tutorials and managed services default to):

  1. Parse the PDF, extract text + tables + images.
  2. Pass each image to a vision LLM with a prompt like "describe this image in detail."
  3. Embed the resulting text summary alongside the regular text chunks.
  4. At query time, retrieve text summaries and feed them to a text LLM.

Every step is reasonable in isolation. The compounding loss is not visible in a 50-question demo. It shows up when a CFO or analyst tries to interrogate the data and the assistant can only paraphrase what the indexer wrote down weeks ago.

The vision LLM in step 2 had been told to "describe in detail." It produced fluent prose: "The chart shows quarterly revenue with a positive growth trajectory across Q1 through Q3." It did not transcribe the axis labels, the data points, or the legend. By the time the user query arrived, the actual numbers no longer existed in the index.

Why the obvious fix is not always the right fix

The instinct most teams have at this point is to switch to "native multimodal" with CLIP or SigLIP. Embed images directly into a shared vector space with text, retrieve the raw image at query time, and let a vision LLM read it on the fly.

This works. It also costs roughly 5 to 10 times more per query, requires a vector database that supports multi-vector collections, and doubles the operational surface area (two embedding models, image storage, vision LLM at query time instead of just at index time).

For a B2B SaaS founder running on a $200k seed cheque, that swap is not a free upgrade. The right question is whether the loss is structural or fixable inside the existing pattern.

How Sapota decides

We treat the choice as three questions, in order:

1. What fraction of the answer-bearing content lives in images, charts, or tables?

If it is under 10%, summary-based is almost always the right call. Fix the prompt, not the architecture. A prompt that says "transcribe all visible numbers, axis labels, and legends before describing the chart" recovers most of the lost fidelity for the cost of a slightly longer summary.

If it is 10% to 30%, summary-based with a stricter image extraction prompt plus table-as-HTML preservation usually clears the bar. We also recommend caching the original image URL in the chunk metadata, so the front end can render the source figure inline next to the answer. Users perceive that as "the AI showed me the chart" even though the model never read it directly.

If it is above 30%, the pipeline starts paying for itself in the cost of constant re-indexing every time the prompt template changes. This is where native multimodal becomes worth the operational complexity.

2. Does the user query benefit from "show me similar visual" search?

Pattern A cannot do image-to-image retrieval. If the product genuinely needs "find slides that look like this one" or "match this product photo to our catalog," that capability does not exist in summary-based pipelines and never will. Adding CLIP is the only path.

3. What is the per-query cost ceiling?

Vision LLM calls at query time run roughly $0.005 to $0.03 per image, depending on provider. A pipeline that pulls top-5 images into a Claude or GPT-4V call is $0.05 to $0.15 per question, before counting the input text tokens. At 10,000 queries a month that is $500 to $1,500. For a consumer-facing assistant that is fine. For an internal compliance tool that runs once a week, the math changes.

What we shipped for the founder

The diagnosis was that roughly 20% of their corpus was charts and tables, the user query distribution was almost entirely "what was the value of X" rather than "find me a similar diagram," and they were already cost-sensitive. A full Pattern B migration would have been over-engineering.

The fix was three changes inside Pattern A:

  1. Image extraction prompt rewrite. Replaced "describe this image" with a structured template asking for chart type, axis labels, transcribed data points, legend entries, and only then a one-sentence interpretation. Summary length doubled but precision recovered.
  2. Table preservation as HTML in chunk metadata. The original unstructured parser was already producing text_as_html. Their indexer was discarding it. We started storing it, then included it inline in the LLM prompt at answer time.
  3. Image URL passthrough. Every chart summary now carries the original S3 URL of the source image. The front end renders the image alongside the answer. Users self-correct when the assistant misreads, because they can see the source.

Recall on numerical questions went from "frequently wrong" to "matches a manual analyst within rounding" on their internal eval set of 80 questions. No vector database migration. No vision LLM at query time. Cost stayed flat.

When you do need to switch

Pattern B is the right call when:

  • The corpus is structurally visual: product catalogs, scientific diagrams, slide decks, infographics.
  • Image-to-image retrieval is a real product requirement, not a nice-to-have.
  • The legal or medical context demands the model see the actual artifact, not a paraphrase. Compliance reviewers will reject "AI summarized the X-ray" but accept "AI read the X-ray."
  • You have the operational maturity for a multi-vector vector DB (Qdrant, Weaviate) and the budget for vision LLM at query time.

The mistake we see most often is teams jumping to Pattern B because the summary-based version "felt limited," when actually the prompt was lazy and the post-retrieval rendering was missing.

A note on ColPali

For document-heavy use cases (research papers, financial filings, slide decks), there is a third option worth knowing about: ColPali skips the OCR/extraction pipeline entirely and treats each page as a single image, using a vision-language model to compute patch-level embeddings.

It outperforms both Pattern A and Pattern B on visually complex documents in published benchmarks, but it carries 32x the storage cost of a bi-encoder and requires GPU inference. Worth a separate conversation, which we are writing up next.

If you are facing this trade-off

Sapota offers an architecture review for teams shipping multimodal RAG that takes a working pipeline, a representative eval set, and a question-distribution sample, and returns a concrete recommendation on which pattern fits with the trade-offs spelled out for your specific cost and recall constraints.

If your AI assistant is paraphrasing charts when users want numbers, or quoting numbers that turn out to be hallucinated, that is the diagnostic conversation we run. Reach out via the AI engineering page and tell us what you are seeing.

Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project

close