RAG in Practice: Implementing & Tuning a Production Pipeline

Cover Image for RAG in Practice: Implementing & Tuning a Production Pipeline
Ram Sathyavageeswaran
Ram Sathyavageeswaran

Continuing from
← Part 1: RAG 101: Designing Retrieval-Augmented Generation Pipelines
Now we're going from notebook → prod


1. Embedding at Scale

# nightly batch
python embed_corpus.py --input s3://docs --output s3://vectors

# webhook for fresh docs
python embed_doc.py new_doc.pdf

Storage Options

  • FAISS flat
    Pros: Blazing local speed
    Cons: Single-box only

  • PGVector
    Pros: SQL joins + ACID
    Cons: Network latency

  • Qdrant
    Pros: gRPC, HNSW, multi-node
    Cons: New ops surface


2. Retrieval Tuning

Key Parameters

  • k: 3–8
    Impact: Higher recall but higher cost

  • Diversity: MMR, score-sum
    Impact: Better topic spread

  • Chunk Size: 256–512 tokens
    Impact: Context fit vs. completeness

Store metadata={source_id, chunk_id} for traceability.


3. Prompt Patterns

  • Stuff – Dump all retrieved chunks
  • Refine – Iterative summarization
  • Map-Reduce – Parallel processing + merge
"You are an expert assistant.
Using ONLY the CONTEXT below,
answer: {{question}}"

4. FastAPI Reference

from fastapi import FastAPI

app = FastAPI()

@app.post("/ask")
async def ask(payload: dict):
    """Handle RAG queries."""
    return await rag.run(payload["query"])

Deploy behind Ray Serve or SageMaker; autoscale on P95 latency.


5. Evaluation Suite

Key Metrics

  • Exact Match / ROUGE-L
    Method: Synthetic Q&A pairs

  • Faithfulness
    Method: LLM-as-judge, 100-sample audit

  • Retrieval Recall
    Method: (gt ∩ retrieved) / gt

Track in Weights & Biases, Evidently-AI, or Prometheus + Grafana panels.


6. Cost & Latency Levers

  • Use smaller embed models (e5-small-v2)
  • Implement Redis-ahead cache (aim > 70% hit-rate)
  • Batch LLM calls (≤ 8 queries)
  • Two-tier RAG:
    Cheap retriever → Expensive reranker → LLM

7. Pre-Prod Checklist

  • [ ] SLA doc signed
  • [ ] CI embedding + lint
  • [ ] Canary rollout / auto-rollback
  • [ ] P0 alert on empty retrieval
  • [ ] Data-retention & PII audit

8. Common Pitfalls

  1. Stale embeddings
    → Schedule incremental jobs

  2. Chunk bleed
    → Use overlap tokens or heading-aware chunks

  3. Prompt overflow
    → Drop low-score chunks or switch to map-reduce


RAG turns static knowledge bases into living, conversational systems—without costly fine-tuning cycles. Nail retrieval, observe everything, and keep prompts lean.

💬 Connect on LinkedIn or get future walkthroughs via Substack.