
RAG-Anything: multi-modal PDF+image RAG in 20 min (2026)
RAG-Anything builds multi-modal RAG over PDFs, images, and tables in 20 minutes — gpt-4o-mini as the brain, ~$0.03 per 10-page document on the free tier.
What you will ship
By the end of this tutorial you will have a Python script that ingests any PDF containing text, embedded images, data tables, and mathematical equations, then answers natural language questions against all four modalities simultaneously. RAG-Anything (built on top of LightRAG, arXiv 2510.12323) wraps a multimodal knowledge-graph pipeline — you supply an OpenAI key, a file path, and three callback functions. It handles MinerU-based PDF parsing, per-modality processors, and knowledge-graph construction automatically. Prerequisites: Python 3.10+, a valid OPENAI_API_KEY, and poppler-utils installed for PDF-to-image rendering. Budget roughly $0.02–$0.06 in OpenAI API calls per 10-page document at gpt-4o-mini + gpt-4o rates.
Step-by-step build
Step 1 — Install the package and system dependency
# Install RAG-Anything with all optional processors
pip install "raganything[all]"
# macOS
brew install poppler
# Debian/Ubuntu
sudo apt install poppler-utils
Step 2 — Create rag_pipeline.py and configure the pipeline
import asyncio
import os
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc
OPENAI_KEY = os.environ["OPENAI_API_KEY"]
config = RAGAnythingConfig(
working_dir="./rag_storage",
parser="mineru", # alternatives: docling, paddleocr
parse_method="auto", # auto | ocr | txt
enable_image_processing=True,
enable_table_processing=True,
enable_equation_processing=True,
)
Step 3 — Wire the LLM, vision, and embedding callbacks
RAG-Anything separates text inference (cheap, gpt-4o-mini) from vision inference (needs gpt-4o for chart accuracy). Passing the wrong model to vision_func is the most common setup mistake — more on this in the gotchas section.
def llm_func(prompt, system_prompt=None, history_messages=[], **kwargs):
return openai_complete_if_cache(
"gpt-4o-mini", prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key=OPENAI_KEY, **kwargs,
)
def vision_func(prompt, system_prompt=None, history_messages=[],
image_data=None, messages=None, **kwargs):
if messages:
return openai_complete_if_cache(
"gpt-4o", "", messages=messages, api_key=OPENAI_KEY, **kwargs
)
if image_data:
return openai_complete_if_cache(
"gpt-4o", "",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}},
]},
],
api_key=OPENAI_KEY, **kwargs,
)
embedding_func = EmbeddingFunc(
embedding_dim=1536,
max_token_size=8192,
func=lambda texts: openai_embed(texts, api_key=OPENAI_KEY),
)
Step 4 — Instantiate, process a document, and run a text query
async def main():
rag = RAGAnything(
config=config,
llm_model_func=llm_func,
vision_model_func=vision_func,
embedding_func=embedding_func,
)
await rag.process_document_complete(
file_path="./annual_report.pdf",
output_dir="./output",
parse_method="auto",
)
answer = await rag.aquery(
"Summarize the key findings shown in the tables and describe the diagrams.",
mode="hybrid",
)
print(answer)
asyncio.run(main())
Step 5 — Query with inline multimodal context
When you already have a specific equation, table cell, or image you want to ask about, pass it directly as multimodal_content — the retriever weights results by modality relevance to that artifact.
mm_answer = await rag.aquery_with_multimodal(
"What does this relevance formula represent in the document?",
multimodal_content=[{
"type": "equation",
"latex": "P(d|q) = \\frac{P(q|d)\\cdot P(d)}{P(q)}",
"equation_caption": "Document relevance probability",
}],
mode="hybrid",
)
print(mm_answer)
Step 6 — Index an entire docs folder in one call
await rag.process_folder_complete(
folder_path="./docs",
output_dir="./output",
)
result = await rag.aquery(
"Which quarterly report shows the highest revenue growth rate?",
mode="global",
)
print(result)
Test it works
After running Step 4, verify the knowledge graph file was created and run a quick sync query:
ls ./rag_storage/graph_chunk_entity_relation.graphml
# should be non-empty — a typical 10-page PDF produces 50–200 entity nodes
result = rag.query("List the main section headings.", mode="naive")
print(result)
# Expected: numbered list of headings from the document
# Empty string means the graph file is missing — re-run process_document_complete
Common gotchas
1. MinerU requires poppler — no fallback. MinerU renders PDF pages to images before extracting layout. Without poppler binaries in PATH you get pdf2image.exceptions.PDFInfoNotInstalledError at index time. Fix it at the OS level (brew install poppler / apt install poppler-utils). If you cannot install system packages, set parser="docling" in RAGAnythingConfig — pure Python, no system deps, but it misses some complex figure captions.
2. gpt-4o-mini in the vision callback silently degrades accuracy by 30–40%. gpt-4o-mini accepts multimodal message payloads and returns a response without error, producing hallucinated chart descriptions with no warning. Reserve gpt-4o-mini exclusively for llm_func (text-only graph queries) and keep vision_func on gpt-4o. The extra cost is small because only image and equation nodes ever hit that path.
3. Re-processing the same file doubles your LLM spend. RAG-Anything persists its knowledge graph across runs in working_dir but does not track which files have been indexed. Calling process_document_complete() on a previously ingested PDF re-runs the full parsing pipeline and bills you again. Track processed files by SHA-256 in a local SQLite table and skip before calling the ingestion method.
Ship it this week
RAG-Anything delivers production-grade multi-modal retrieval in under 50 lines of Python — no custom parsers, no separate image pipeline, no vector-store configuration beyond the default LightRAG storage. Pair it with AI-generated images (see our tutorial on gpt-image-2 API for 2K AI images) to make visual output fully searchable. Already running autonomous agents? Wire a FastAPI endpoint around rag.query() and call it as a retrieval tool from your OpenAI Agents SDK workflow.
Get weekly highlights
No spam, unsubscribe anytime.
Chatbase
Build AI chatbots for your website in minutes. Train on your docs, FAQ, and PDFs.
Cal.com
Open source scheduling — self-host your booking system, replace Calendly. Free & privacy-first.



Comments (0)
Sign in to comment
No comments yet. Be the first to comment!