RAG in Practice: Implementing & Tuning a Production Pipeline

Ram Sathyavageeswaran

Cover Image for RAG in Practice: Implementing & Tuning a Production Pipeline

Ram Sathyavageeswaran

April 26, 2025

Continuing from
← Part 1: RAG 101: Designing Retrieval-Augmented Generation Pipelines
Now we're going from notebook → prod

1. Embedding at Scale

# nightly batch
python embed_corpus.py --input s3://docs --output s3://vectors

# webhook for fresh docs
python embed_doc.py new_doc.pdf

Storage Options

FAISS flat
Pros: Blazing local speed
Cons: Single-box only
PGVector
Pros: SQL joins + ACID
Cons: Network latency
Qdrant
Pros: gRPC, HNSW, multi-node
Cons: New ops surface

2. Retrieval Tuning

Key Parameters

k: 3–8
Impact: Higher recall but higher cost
Diversity: MMR, score-sum
Impact: Better topic spread
Chunk Size: 256–512 tokens
Impact: Context fit vs. completeness

Store metadata={source_id, chunk_id} for traceability.

3. Prompt Patterns

Stuff – Dump all retrieved chunks
Refine – Iterative summarization
Map-Reduce – Parallel processing + merge

"You are an expert assistant.
Using ONLY the CONTEXT below,
answer: {{question}}"

4. FastAPI Reference

from fastapi import FastAPI

app = FastAPI()

@app.post("/ask")
async def ask(payload: dict):
    """Handle RAG queries."""
    return await rag.run(payload["query"])

Deploy behind Ray Serve or SageMaker; autoscale on P95 latency.

5. Evaluation Suite

Key Metrics

Exact Match / ROUGE-L
Method: Synthetic Q&A pairs
Faithfulness
Method: LLM-as-judge, 100-sample audit
Retrieval Recall
Method: (gt ∩ retrieved) / gt

Track in Weights & Biases, Evidently-AI, or Prometheus + Grafana panels.

6. Cost & Latency Levers

Use smaller embed models (e5-small-v2)
Implement Redis-ahead cache (aim > 70% hit-rate)
Batch LLM calls (≤ 8 queries)
Two-tier RAG:
Cheap retriever → Expensive reranker → LLM

7. Pre-Prod Checklist

[ ] SLA doc signed
[ ] CI embedding + lint
[ ] Canary rollout / auto-rollback
[ ] P0 alert on empty retrieval
[ ] Data-retention & PII audit

8. Common Pitfalls

Stale embeddings
→ Schedule incremental jobs
Chunk bleed
→ Use overlap tokens or heading-aware chunks
Prompt overflow
→ Drop low-score chunks or switch to map-reduce

RAG turns static knowledge bases into living, conversational systems—without costly fine-tuning cycles. Nail retrieval, observe everything, and keep prompts lean.

💬 Connect on LinkedIn or get future walkthroughs via Substack.