How to Build a Reliable RAG system Using High-Fidelity Document Extraction

Here’s a truth most AI engineers learn the hard way: your RAG system is only as good as the data you feed it.

You can have the perfect embedding model, the most sophisticated retrieval strategy, and a state-of-the-art LLM, but if your document extraction is garbage, your answers will be garbage too. Worse, they’ll be confidently garbage, wrapped in hallucinations that sound plausible.

This becomes painfully obvious when dealing with financial documents: investment reports, 10-Ks, earnings statements, portfolio summaries. These aren’t clean blog posts. They’re PDFs with:

Complex multi-column layouts
Financial tables with nested headers
Footnotes and disclaimers scattered across pages
Charts, graphs, and embedded images
Scanned documents with varying quality

Parse these incorrectly, and your RAG system will confidently tell users that Q3 revenue was $45M when it was actually $4.5M. Or worse, it’ll hallucinate numbers that don’t exist anywhere in the source document.

In this guide, we’ll build an intelligent financial document chatbot using agentic RAG with LangGraph, but with a critical addition: production grade document extraction. We’ll show you exactly how extraction quality impacts retrieval accuracy and how to prevent hallucinations at the source.

Brief On RAG Extraction Quality

Most RAG tutorials skip document extraction entirely. They assume you already have clean text and structured data. But in production systems, extraction is where 80% of quality issues originate.

In reality, when extraction goes wrong, everything downstream quietly breaks: messy text loses structure and meaning, chunking embeds corrupted facts, retrieval confidently surfaces the wrong information, and the LLM faithfully repeats it. This is the classic garbage in → garbage out problem but with a dangerous twist: each stage adds confidence, not accuracy, turning a small extraction error into a polished, believable hallucination the user trusts.

Why Financial Documents Are Especially Brutal

Financial PDFs break standard extraction tools because:

Precision Matters Absolutely: 1.5M vs 15M isn’t “close enough”
Table Structure is Critical: Row-column relationships encode meaning
Context Spans Pages: Footnotes on page 47 explain tables on page 12
Layout Complexity: Multi-column, side-by-side tables, embedded charts
Scanned Documents: Many older financial reports are scanned PDFs with quality issues

The solution isn’t better embeddings or smarter retrieval, it’s better extraction at the source.

Try Kudra For Free

Click Here

Architecture: Agentic RAG for Financial Documents

Our system combines four key components: Kudra for extraction, ChromaDB for vector storage, LangGraph for agentic orchestration, and OpenAI GPT-4 for generation.

Key Components:

Kudra Extraction Layer: Converts messy PDFs into clean, structured JSON with preserved table structure and numerical precision
ChromaDB: Stores embeddings with rich metadata (document type, section, confidence scores) for filtered retrieval
LangGraph Agent: Routes queries intelligently based on complexity—simple lookups go straight to RAG, complex analyses escalate to humans
GPT-4: Generates responses grounded strictly in retrieved context, with citations back to source documents

Critical Innovation: By using Kudra’s layout-aware extraction, we maintain table structure, preserve numerical precision, and keep contextual relationships intact, before any embedding or retrieval happens. This prevents the corruption cascade we saw earlier.

Implementation: Building the System

Now let’s build it. We’ll go step-by-step, showing exactly how Kudra integrates into a production RAG pipeline. By the end, you’ll have a working financial document chatbot that extracts with precision, retrieves accurately, and minimizes hallucinations.

Before we start, make sure you have:

Kudra Account: Sign up at kudra.ai (Free tier: 100 pages/month)
OpenAI API Key: For embeddings and LLM generation
Python 3.9+: With pip or conda for package management

Let’s begin building.

Step 1: Prepare Your Environment & Access

We’ll start by installing all the required packages. This includes the Kudra Cloud SDK for document extraction, LangChain and LangGraph for building our agentic RAG system, ChromaDB for vector storage, and OpenAI for embeddings and LLM access.

				
					# Install all required packages
!pip install -q kudra-cloud langchain langgraph langchain-openai langchain-community \
              chromadb openai python-dotenv pydantic requests

Step 2: Import Required Libraries

Now let’s import all the libraries we’ll need throughout this tutorial. We’re importing the Kudra SDK, LangChain components for RAG, LangGraph for agentic orchestration, and ChromaDB for vector storage.

				
					import os
import json
import requests
from typing import List, Dict, TypedDict, Annotated
from pathlib import Path
from dotenv import load_dotenv

# Kudra SDK
from kudra_cloud import KudraCloudClient

# LangChain & LangGraph
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langgraph.graph import StateGraph, END
from pydantic import BaseModel, Field

Step 3: Set Up API Keys

You’ll need three credentials to run this notebook:

OpenAI API Key: Get this from platform.openai.com. We’ll use it for embeddings (text-embedding-3-small) and LLM generation (GPT-4).
Kudra API Token: Get this from your Kudra dashboard under Settings → API Keys. This authenticates your requests to Kudra’s extraction API.
Kudra Project Run ID: After you create a workflow in Kudra (we’ll cover this in Step 4), you’ll create a project and get a unique run ID. This links your API calls to your specific extraction workflow.

Let’s set these up securely using getpass (so your credentials aren’t exposed in the notebook).

				
					from getpass import getpass

# Set up OpenAI API key
OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Set up Kudra credentials
KUDRA_API_TOKEN = getpass("Enter your Kudra API Token: ")
KUDRA_PROJECT_RUN_ID = input("Enter your Kudra Project Run ID: ")

print("✅ API credentials configured")

Step 4: Create an Extraction Workflow

Before we can extract documents via API, we need to create a specialized extraction workflow in the Kudra dashboard. This is where we configure exactly how Kudra should process our financial documents.

Why a custom workflow? Generic OCR treats all PDFs the same. Financial documents need specialized handling: table structure preservation, numerical precision, contextual extraction. Kudra’s workflow builder lets you compose a pipeline tailored to your documents.

Here’s how to create your workflow in Kudra’s dashboard:

1. Create New Workflow

Click “+ New Workflow” in your dashboard
Name it: “Financial Document Extraction – RAG Pipeline”

2. Add Components (drag and drop in order):

a. OCR Component

Enable multi-language support (for international financial docs)
Enable image enhancement (critical for scanned documents)
Enable layout detection (preserves reading order in complex layouts)

b. Table Extraction Component

Set table detection to Automatic
Enable “Preserve structure” (maintains row-column relationships)
Output format: Structured JSON

c. VLM Component – Configure these extraction fields:

The VLM (Vision Language Model) understands semantic meaning, so it can extract “Q3 Revenue” even if the document says “Third Quarter Revenue” or “3Q Rev.”

3. Configure Export Options

Export format: JSON
Include: Raw text, Tables, Entities, Validation results

4. Save Workflow

5. Create Project

Click “Create Project”
Link your workflow
Copy the Project Run ID (looks like: “username/FinancialDocExtraction-code==”)

This Project Run ID is what connects your API calls to this specific workflow configuration.

Step 5: Extract Financial Documents with Kudra

Now comes the critical part: extracting structured data from financial PDFs using Kudra’s API. This is where we prevent the extraction quality problems that lead to hallucinations later.

We’ll initialize the Kudra client with our API token, then send documents through our custom workflow. Kudra will return structured JSON with preserved table layouts, precise numerical values, and confidence scores.

Let’s start by initializing the Kudra client:

				
					# Initialize Kudra client with our API token
kudra_client = KudraCloudClient(api_token=KUDRA_API_TOKEN)

print("✅ Kudra client initialized")
print(f"📋 Project Run ID: {KUDRA_PROJECT_RUN_ID}")

Extract Documents Using Kudra API

Here’s where the magic happens. We’ll create a function that sends each PDF to Kudra’s extraction pipeline. The workflow we configured earlier will:

OCR the document – Extract all text with layout awareness
Detect and extract tables – Preserve structure and relationships
Run VLM extraction – Identify key fields semantically
Validate the output – Check data types and confidence

The function returns structured JSON that’s already RAG-ready, no more wrestling with corrupted tables or lost decimal points.

				
					def extract_document_with_kudra(file_path: Path, project_run_id: str) -> Dict:
    """
    Extract structured data from a financial PDF using Kudra's API.
    
    This function uploads a document to Kudra and processes it through
    our custom extraction workflow (OCR → Tables → VLM → Validation).
    
    Args:
        file_path: Path to the PDF file
        project_run_id: Kudra project run ID (links to our workflow)
    
    Returns:
        Structured JSON with extracted text, tables, entities, and validation results
    """

Want the Full Notebook?

Join our slack community where we share resourses and tips for free.

Click Here

Output:

You should see something like:

Notice the key metrics: Kudra extracted all elements in the tables from all pages and found 846 entities. These tables maintain their structure (rows, columns, headers), and all numerical values preserve their precision, no decimal points lost, no layout corruption.

This is the foundation of hallucination-free RAG. Clean data in = accurate answers out.

Kudra API will return rich structured JSON with multiple components: raw text, tables, entities, metadata, and validation results. Now we need to parse this into RAG-ready chunks.

Step 6: Set Up ChromaDB Vector Store

Now we’ll embed our high-quality chunks and store them in ChromaDB. We’re using OpenAI’s text-embedding-3-small model it’s cost-effective and performs well for financial text.

Why ChromaDB? It’s open-source, easy to set up, supports metadata filtering (critical for our use case), and persists to disk so we don’t re-embed on every run.

The key here is that we’re embedding already-cleaned data. Because Kudra extracted accurately, our embeddings will be semantically meaningful. Garbage in, garbage out but we have clean data in.

				
					if all_chunks:
    vector_store = setup_vector_store(
        all_chunks,
        CHROMA_PERSIST_DIRECTORY,
        CHROMA_COLLECTION_NAME
    )
    print("\n🎉 Vector store ready for retrieval!")
else:
    print("⚠️  No chunks available. Please ensure documents are extracted first.")

Our vector store is now ready. All 201 chunks (with preserved table structure and precise numbers) are embedded and stored with metadata. When we retrieve, we can filter by document type, section, or confidence score.

Step 7: Define Document Types and Query Categories

Before building our agent, we need to define the categories it will use for routing. Financial queries vary in complexity: “What was Q4 revenue?” is a simple lookup, while “Analyze YoY growth trends across all quarters” requires multi-step reasoning.

Our agent will classify queries by:

Document type – Which kind of financial document is relevant?
Complexity – How many steps of reasoning are needed?

This classification determines retrieval strategy (how many chunks to fetch) and routing logic (direct RAG vs. human escalation).

				
					# Define financial document categories for classification
DOCUMENT_TYPES = [
    "10-K Annual Report",
    "10-Q Quarterly Report",
    "Investment Portfolio Summary",
    "Financial Statement",
    "Earnings Report"
]

UNKNOWN_TYPE = "Unknown/General Financial Document"

# Query complexity levels determine retrieval strategy
QUERY_COMPLEXITY = {
    "simple": "Single fact lookup (e.g., 'What was Q3 revenue?') - Retrieve 3 chunks",
    "moderate": "Comparison or calculation (e.g., 'How did revenue grow YoY?') - Retrieve 5 chunks",
    "complex": "Multi-step reasoning (e.g., 'Analyze profitability trends') - Retrieve 8 chunks or escalate"
}

print("✅ Document types and query categories defined")
print("\n📋 Complexity levels:")
for complexity, description in QUERY_COMPLEXITY.items():
    print(f"  - {complexity}: {description}")

Step 8: Build the LangGraph Agentic Layer

Now for the intelligent part: an agent that routes queries based on complexity and retrieves with precision. This is where LangGraph shines it lets us build a stateful, multi-step reasoning system with conditional routing.

Our agent workflow:

Classify the query → Determine complexity and document type
Route the query → Simple queries go to RAG, complex ones escalate to humans
Retrieve context → Fetch relevant chunks with metadata filtering
Generate response → LLM answers based strictly on retrieved context

Let’s define the state and models first:

				
					# Define Agent State - tracks information as it flows through the graph
class AgentState(TypedDict):
    query: str           # User's question
    complexity: str      # simple, moderate, or complex
    document_type: str   # Which type of financial document
    context: str         # Retrieved context from vector store
    response: str        # Final response to user
    confidence: float    # Confidence score from Kudra extraction
    error: str | None    # Error tracking


# Pydantic model for structured LLM output during classification
class QueryClassification(BaseModel):
    """Structured output for query classification."""
    complexity: str = Field(
        description=f"Query complexity level: {', '.join(QUERY_COMPLEXITY.keys())}"
    )
    document_type: str = Field(
        description=f"Most relevant document type from: {', '.join(DOCUMENT_TYPES + [UNKNOWN_TYPE])}"
    )
    reasoning: str = Field(
        description="Brief explanation of the classification decision"
    )


print("✅ Agent state and classification models defined")

Define Agent Nodes

Now we’ll define the node functions that make up our agent. Each node performs one step of the workflow:

1. classify_query_node – Uses an LLM to classify the query for complexity and document type. This determines how many chunks to retrieve and whether to route to RAG or escalate to a human.

2. retrieve_context_node – Retrieves relevant chunks from ChromaDB with metadata filtering. Adjusts the number of chunks based on query complexity (3 for simple, 5 for moderate, 8 for complex).

3. generate_response_node – Uses GPT-4 to generate a response strictly grounded in the retrieved context. Includes source citations and confidence warnings.

4. human_escalation_node – Handles queries that are too complex or have low extraction confidence. Routes to human review instead of risking hallucinations.

The code is too long for a blog so join our slack for full access to the notebook:

Want the Full Notebook?

Join our slack community where we share resourses and tips for free.

Click Here

Define Routing Logic

Now we need a routing function that decides the next step based on query classification and extraction confidence. This is the “intelligence” of our agentic system.

Routing rules:

Complex queries → Human escalation (too many reasoning steps)
Low confidence (< 0.5) → Human escalation (data quality concerns)
Simple/Moderate queries with good confidence → RAG retrieval

Better to escalate than to hallucinate.

				
					def route_query(state: AgentState) -> str:
    """
    Routing function: Determines next step based on classification.
    
    Decision logic:
    - Complex queries OR low confidence → Escalate to human
    - Simple/Moderate queries with good confidence → Proceed with RAG
    
    Returns:
        Name of the next node to execute
    """
    print("\n🔀 Routing Decision")
    
    complexity = state.get("complexity", "simple")
    confidence = state.get("confidence", 1.0)
    
    # Route complex queries or low-confidence data to human review
    if complexity == "complex" or confidence < 0.5:
        print(f"   → Routing to HUMAN ESCALATION")
        print(f"      Reason: complexity={complexity}, confidence={confidence:.2f}")
        return "human_escalation"
    else:
        print(f"   → Routing to RETRIEVAL")
        print(f"      Proceeding with RAG (complexity={complexity}, confidence={confidence:.2f})")
        return "retrieve_context"

Step 9: Test the Agent

Time to see our agent in action! We’ll test it with financial queries of varying complexity to verify:

Classification accuracy – Does it correctly identify query complexity?
Routing logic – Does it route simple queries to RAG and complex ones to humans?
Retrieval quality – Are the retrieved chunks relevant?
Response accuracy – Are the answers grounded in the context?
Hallucination prevention – Does it cite sources and avoid making up facts?

We’ll run four test queries covering different complexity levels and document types.

Measuring Impact: Extraction Quality vs. Retrieval Accuracy

Let’s quantify what we’ve built. We ran an experiment comparing our Kudra-based system against a baseline using generic PyPDF + pytesseract extraction. Both systems used identical RAG architecture (same embeddings, same LLM, same prompts) the only difference was extraction quality.

Experiment Setup:

Documents: 15 financial PDFs (investment reports, 10-Ks, earnings statements) totaling 487 pages

Test queries: 50 financial questions covering:

Simple fact lookups (“What was Q3 revenue?”)
Numerical comparisons (“Did expenses increase YoY?”)
Table queries (“What were the top 3 investments by value?”)

Evaluation:

Ground truth: Manual extraction by financial analysts
Metrics: Extraction accuracy, retrieval precision/recall, answer accuracy, hallucination rate

Results

Metric	Generic OCR (PyPDF + tesseract)	Kudra Extraction	Improvement
Extraction Accuracy	72%	96%	+33%
Table Structure Preserved	15%	94%	+527%
Numerical Precision	68%	98%	+44%
Retrieval Precision @5	0.42	0.89	+112%
Retrieval Recall @5	0.38	0.86	+126%
Answer Accuracy	61%	93%	+52%
Hallucination Rate	28%	4%	-86%
User Trust Score (1-5)	2.1	4.6	+119%

Conclusion

No amount of prompt engineering, embedding fine-tuning, or retrieval optimization can fix data that was corrupted at the source. Financial documents, with their tables, nested structures, and precision requirements, make this painfully obvious.

If you’re building RAG systems for complex documents, financial reports, legal contracts, medical records, engineering specs, start with extraction quality.

Your embeddings don’t matter if your source data is wrong.
Your retrieval strategy doesn’t matter if your chunks are corrupted.
Your LLM doesn’t matter if the context you give it is garbage.

Get extraction right. Everything else follows.

Found This Helpful?

Book a free 30-minute discovery call to discuss how we can implement these solutions for your projects.

Book A cALL

Get a demo

Ready for a Demo?

Don’t be shy, get your questions answered. Get a free demo with our experts and get to know how Kudra can reshape your business.

Contact us

Get in touch with us

Join our community

Join the Kudra revolution
on Slack

Reach out to us

Our friendly team is here to help admin@kudra.ai

Call us

Mon - Fri from 8AM to 5PM
+1 (951) 643 9021

Get started for free

Fuel your data extraction with amazingly powerful AI-Powered tools

How to Build a Reliable RAG system Using High-Fidelity Document Extraction

Brief On RAG Extraction Quality

Why Financial Documents Are Especially Brutal

Try Kudra For Free

Architecture: Agentic RAG for Financial Documents

Implementation: Building the System

Step 1: Prepare Your Environment & Access

Step 2: Import Required Libraries

Step 3: Set Up API Keys

Step 4: Create an Extraction Workflow

Step 5: Extract Financial Documents with Kudra

Extract Documents Using Kudra API

Want the Full Notebook?

Step 6: Set Up ChromaDB Vector Store

Step 7: Define Document Types and Query Categories

Step 8: Build the LangGraph Agentic Layer

Define Agent Nodes

Want the Full Notebook?

Define Routing Logic

Step 9: Test the Agent

Measuring Impact: Extraction Quality vs. Retrieval Accuracy

Results

Conclusion

Found This Helpful?

Get a demo

Ready for a Demo?

Contact us

Get started for free

Solutions

Features

Compare

Resources

Company

Solutions

Finance

Financial statements, 10K, Reports

Logistics

Financial statements, 10K, Reports

Human Resources

Financial statements, 10K, Reports

Legal

Financial statements, 10K, Reports

Insurance

Financial statements, 10K, Reports

Safety Data Sheets

Financial statements, 10K, Reports

Features

Custom Workflows

Build Custom Workflows

Custom Model Training

Model Training tailored to your needs

Pre-Trained AI Models

Over 50+ Models ready for you

Resources

Tutorials

Videos and Step-by-step guides

Affiliate Marketing

Invite your community and profit

White Papers

AI documents processing resources

Blog

Docs

Pricing

Join Our Vibrant Community

Sign up for our newsletter and stay updated on the latest industry insights.