RAG in Practice: Implementing & Tuning a Production Pipeline



Continuing from
← Part 1: RAG 101: Designing Retrieval-Augmented Generation Pipelines
Now we're going from notebook → prod
1. Embedding at Scale
# nightly batch
python embed_corpus.py --input s3://docs --output s3://vectors
# webhook for fresh docs
python embed_doc.py new_doc.pdf
Storage Options
-
FAISS flat
Pros: Blazing local speed
Cons: Single-box only -
PGVector
Pros: SQL joins + ACID
Cons: Network latency -
Qdrant
Pros: gRPC, HNSW, multi-node
Cons: New ops surface
2. Retrieval Tuning
Key Parameters
-
k: 3–8
Impact: Higher recall but higher cost -
Diversity: MMR, score-sum
Impact: Better topic spread -
Chunk Size: 256–512 tokens
Impact: Context fit vs. completeness
Store metadata={source_id, chunk_id}
for traceability.
3. Prompt Patterns
- Stuff – Dump all retrieved chunks
- Refine – Iterative summarization
- Map-Reduce – Parallel processing + merge
"You are an expert assistant.
Using ONLY the CONTEXT below,
answer: {{question}}"
4. FastAPI Reference
from fastapi import FastAPI
app = FastAPI()
@app.post("/ask")
async def ask(payload: dict):
"""Handle RAG queries."""
return await rag.run(payload["query"])
Deploy behind Ray Serve or SageMaker; autoscale on P95 latency.
5. Evaluation Suite
Key Metrics
-
Exact Match / ROUGE-L
Method: Synthetic Q&A pairs -
Faithfulness
Method: LLM-as-judge, 100-sample audit -
Retrieval Recall
Method:(gt ∩ retrieved) / gt
Track in Weights & Biases, Evidently-AI, or Prometheus + Grafana panels.
6. Cost & Latency Levers
- Use smaller embed models (
e5-small-v2
) - Implement Redis-ahead cache (aim > 70% hit-rate)
- Batch LLM calls (≤ 8 queries)
- Two-tier RAG:
Cheap retriever → Expensive reranker → LLM
7. Pre-Prod Checklist
- [ ] SLA doc signed
- [ ] CI embedding + lint
- [ ] Canary rollout / auto-rollback
- [ ] P0 alert on empty retrieval
- [ ] Data-retention & PII audit
8. Common Pitfalls
-
Stale embeddings
→ Schedule incremental jobs -
Chunk bleed
→ Use overlap tokens or heading-aware chunks -
Prompt overflow
→ Drop low-score chunks or switch to map-reduce
RAG turns static knowledge bases into living, conversational systems—without costly fine-tuning cycles. Nail retrieval, observe everything, and keep prompts lean.
💬 Connect on LinkedIn or get future walkthroughs via Substack.