RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026
RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026
Stop AI from Making Things Up -- RAG Is the Most Practical Solution Right Now
Have you ever experienced this:
You ask the AI "What's our company's vacation policy?" and it gives an answer that sounds perfectly reasonable -- but is completely wrong about your company's rules?
The AI isn't lying to you. It genuinely doesn't know your company's policy -- because your internal data wasn't in its training set.
RAG (Retrieval-Augmented Generation) is the technique that makes AI "look up the data first" before answering.
This tutorial takes you through building a RAG system from scratch. No AI PhD required -- basic Python skills are enough.
Want to build a RAG system? CloudInsight helps you choose the best LLM API, with enterprise plan discounts.

TL;DR
RAG system core workflow: Document chunking -> Embedding vectorization -> Store in vector database -> User asks question -> Vector search finds relevant data -> Inject data into prompt -> LLM generates answer. Recommended stack: OpenAI Embedding + Qdrant/Pinecone + Claude Sonnet/GPT-4o. You can build a prototype in one day.
RAG Architecture Design & Component Selection
Answer-First: A RAG system consists of three major components -- an Embedding model (text to vectors), a vector database (store and search vectors), and an LLM (generate answers). Each component has multiple options, and the right combination is key to maximizing effectiveness.
Complete RAG Architecture Diagram
Offline Pipeline (Build Knowledge Base)
══════════════════════════════════════
Documents/Data -> Chunking -> Embedding -> Vector Database
(Splitting) (Vectorize) (Storage)
Online Pipeline (Answer Questions)
══════════════════════════════════
User Question -> Embedding -> Vector Search -> Get Relevant Docs
| |
Query Vector Relevant Chunks
|
Compose Prompt -> LLM -> Answer
Core Component Selection
| Component | Recommended | Alternatives |
|---|---|---|
| Embedding | OpenAI text-embedding-3-small | Cohere embed-v3, BGE-M3 |
| Vector DB | Qdrant (self-hosted) or Pinecone (managed) | Weaviate, Chroma, pgvector |
| LLM | Claude Sonnet or GPT-4o | GPT-4o-mini, Gemini Pro |
| Framework | LangChain or LlamaIndex | Custom-built (more flexible) |
Document Chunking Strategy
This is critical to RAG system quality. Chunks too large reduce search precision. Too small and they lack context.
| Strategy | Chunk Size | Overlap | Best For |
|---|---|---|---|
| Fixed size | 500 tokens | 50 tokens | General use |
| Paragraph split | By paragraph | 1-2 sentences | Structured docs |
| Semantic split | Dynamic | Automatic | Mixed content |
| Recursive split | 200-1000 | 10-20% | Code files |
Our recommendation: Start with fixed size (500 tokens, 50 tokens overlap), then adjust based on retrieval quality.
For core LLM technology principles and model differences, refer to What Is an LLM? Beginner's Guide.
Choosing the Best LLM API for RAG
Answer-First: In a RAG system, the LLM's job is to "generate answers based on retrieved data." What's needed is instruction-following ability and citation accuracy -- not necessarily the most powerful model.
LLM Performance in RAG Scenarios
| Capability | GPT-4o | Claude Sonnet | Gemini Pro | GPT-4o-mini |
|---|---|---|---|---|
| Instruction following | 9/10 | 10/10 | 8/10 | 8/10 |
| Citation accuracy | 8/10 | 9/10 | 8/10 | 7/10 |
| Refusal ability | 7/10 | 9/10 | 7/10 | 6/10 |
| Long context | 128K | 200K | 1M | 128K |
| Cost efficiency | Medium | Medium | Good | Excellent |
Why does "refusal ability" matter?
In a RAG system, if retrieved data isn't sufficient to answer the user's question, the AI should say "I'm not sure" rather than fabricating an answer. Claude excels at this -- it honestly tells you "Based on the provided data, I cannot answer this question."
Cost Estimates
Assuming your RAG system receives 100 questions per day, with about 2,000 tokens of context per question:
| Model | Daily Cost | Monthly Cost |
|---|---|---|
| GPT-4o | ~$0.50 | ~$15 |
| Claude Sonnet | ~$0.60 | ~$18 |
| Gemini Pro | ~$0.25 | ~$7.5 |
| GPT-4o-mini | ~$0.03 | ~$1 |
Embedding costs are separate (typically very low, around $1-5/month).
For more cost analysis, refer to AI API Pricing Comparison.
CloudInsight offers LLM API enterprise purchasing with discounts and technical support. Get a quote for RAG system API purchasing ->
Embedding & Vector Database Setup
Answer-First: Embedding is the process of converting text into numerical vectors. A vector database is a specialized database for storing and searching these vectors. Together, they enable your RAG system to quickly find the most relevant data.
How Embedding Works
In simple terms:
"The capital of Taiwan is Taipei"
-> Embedding model
-> [0.023, -0.156, 0.891, ...] (1536 numbers)
Semantically similar sentences produce vectors that are close together. So searching just requires comparing "distances" between vectors.
Implementation: Building a RAG Pipeline
# Step 1: Install required packages
# pip install openai qdrant-client langchain
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# Step 2: Initialize
openai_client = OpenAI()
qdrant = QdrantClient(":memory:") # In-memory mode for development
# Step 3: Create vector collection
qdrant.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
# Step 4: Embed documents and store
def embed_and_store(documents):
for i, doc in enumerate(documents):
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=doc["content"]
)
qdrant.upsert(
collection_name="knowledge_base",
points=[PointStruct(
id=i,
vector=response.data[0].embedding,
payload={"content": doc["content"], "source": doc["source"]}
)]
)
# Step 5: Search for relevant documents
def search(query, top_k=3):
query_vector = openai_client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = qdrant.search(
collection_name="knowledge_base",
query_vector=query_vector,
limit=top_k
)
return results
# Step 6: RAG answer generation
def rag_answer(question):
results = search(question)
context = "\n".join([r.payload["content"] for r in results])
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer the question based on the following data. If the data doesn't contain relevant information, say 'Based on available data, I cannot answer this question.'\n\nData:\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Vector Database Comparison
| Database | Deployment | Free Tier | Scale | Highlights |
|---|---|---|---|---|
| Pinecone | Managed | 1M vectors | Medium-large | Simplest |
| Qdrant | Self-hosted/Managed | Open source free | Any | Most features |
| Weaviate | Self-hosted/Managed | Open source free | Medium-large | GraphQL interface |
| Chroma | Self-hosted | Open source free | Small | Most lightweight |
| pgvector | Self-hosted (PostgreSQL) | Open source free | Small-medium | Integrates with existing DB |

RAG Performance Optimization Tips
Answer-First: RAG quality depends on three stages -- retrieval quality, context assembly, and generation quality. Most issues stem from insufficient retrieval quality.
Common Problems & Solutions
Problem 1: Irrelevant search results
Solutions:
- Try different chunk sizes
- Use Hybrid Search (vector search + keyword search)
- Add metadata filtering (e.g., document type, date)
Problem 2: Incomplete answers
Solutions:
- Increase top_k (number of search results)
- Use multi-step retrieval (broad search first, then refine)
- Increase chunk overlap
Problem 3: Hallucinations in answers
Solutions:
- Emphasize "answer only based on provided data" in the System Prompt
- Require the LLM to cite sources
- Use a more instruction-following model (Claude Sonnet)
Problem 4: Too slow
Solutions:
- Use a smaller embedding model (text-embedding-3-small)
- Reduce top_k
- Use streaming responses
- Cache common Q&A pairs
Advanced Optimization: Reranking
After retrieval, use a reranker model to re-sort results:
# Using Cohere Rerank
import cohere
co = cohere.Client()
results = co.rerank(
query="What is the return policy",
documents=["search result 1", "search result 2", "search result 3"],
model="rerank-v3.5"
)
Reranking can boost search accuracy from 70% to 85%+.
For more technical tutorials, check out API Tutorial Beginner's Guide and AI Code Generation Guide.

FAQ: RAG Application Common Questions
How many document formats can RAG handle?
RAG itself isn't format-limited. As long as you can convert a document to text, it can go into a RAG system. Commonly supported: PDF, Word, Excel, HTML, Markdown, plain text. Images and tables require additional OCR or multimodal processing.
How big of a server does a RAG system need?
It depends on the architecture. If using cloud services (OpenAI API + Pinecone), no server needed at all. If fully self-hosted (open-source LLM + Qdrant), we recommend at least 16GB RAM and a 4-core CPU. GPU isn't necessarily required (unless self-hosting an LLM).
Can RAG be used for real-time chatbots?
Yes, but speed optimization is needed. We recommend streaming responses, caching common Q&A pairs, and choosing fast models (GPT-4o-mini or Gemini Flash). Typical latency: 2-5 seconds.
How often should the knowledge base be updated?
It depends on how frequently your data changes. We recommend setting up an automated pipeline: document update -> re-embedding -> write to vector database. This process can be automated with CI/CD tools.
LangChain vs LlamaIndex -- which is better?
LangChain is more comprehensive, suited for complex AI applications. LlamaIndex focuses on RAG with simpler setup. If you're only doing RAG, go with LlamaIndex. If you also need Agent functionality and more, choose LangChain.
For complete LLM and RAG strategy guides, return to LLM & RAG Application Guide.
Conclusion: RAG Is the Most Pragmatic Enterprise AI Solution
No fine-tuning needed. No training your own model.
RAG lets you use off-the-shelf LLM APIs + your own data to build reliable AI applications.
Next steps:
- Prepare your knowledge base documents
- Choose your Embedding and LLM APIs
- Build a prototype using the code above
- Test, optimize, deploy
The entire process can produce a prototype in one day. Don't over-engineer -- build it first, then iterate.
Get a Quote for Enterprise Plans
CloudInsight offers LLM API enterprise purchasing for RAG systems:
- Unified purchasing of OpenAI Embedding + Claude/GPT generation APIs
- Enterprise-exclusive discounts to reduce RAG system operating costs
- Invoices included, technical support available
Get a quote for enterprise plans -> | Join LINE for instant consultation ->
References
- OpenAI - Embedding Models Documentation (2026)
- Qdrant - Official Documentation (2026)
- Pinecone - Vector Database Guide (2026)
- LangChain - RAG Tutorial (2026)
- Cohere - Rerank API Documentation (2026)
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026",
"author": {
"@type": "Person",
"name": "CloudInsight Technical Team",
"url": "https://cloudinsight.cc/about"
},
"datePublished": "2026-03-21",
"dateModified": "2026-03-22",
"publisher": {
"@type": "Organization",
"name": "CloudInsight",
"url": "https://cloudinsight.cc"
}
}
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "How big of a server does a RAG system need?",
"acceptedAnswer": {
"@type": "Answer",
"text": "If using cloud services (OpenAI API + Pinecone), no server is needed at all. If fully self-hosted, we recommend at least 16GB RAM and a 4-core CPU."
}
},
{
"@type": "Question",
"name": "Can RAG be used for real-time chatbots?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes, but speed optimization is needed. We recommend streaming responses, caching common Q&A pairs, and choosing fast models. Typical latency: 2-5 seconds."
}
},
{
"@type": "Question",
"name": "LangChain vs LlamaIndex -- which is better?",
"acceptedAnswer": {
"@type": "Answer",
"text": "LangChain is more comprehensive, suited for complex AI applications. LlamaIndex focuses on RAG with simpler setup. For RAG only, go with LlamaIndex."
}
}
]
}
Need Professional Cloud Advice?
Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help
Book Free ConsultationRelated Articles
LLM & RAG Application Guide | 2026 Large Language Model API Selection & RAG Practical Tutorial
2026 complete guide to LLM & RAG applications! Learn about large language model API selection, RAG architecture design, and LLM inference optimization to build enterprise-grade AI applications.
LLMWhat is RAG? Complete LLM RAG Guide: From Principles to Enterprise Knowledge Base Applications [2026 Update]
What is RAG Retrieval-Augmented Generation? This article fully explains RAG principles, vector databases, Embedding technology, covering GraphRAG, Hybrid RAG, Reranking, RAG-Fusion and other 2026 advanced techniques, plus practical enterprise knowledge base and customer service chatbot cases.
AI APIAI API Pricing Comparison | 2026 Complete Guide to OpenAI, Claude, and Gemini Pricing
The latest 2026 AI API pricing comparison! A thorough analysis of OpenAI, Claude, and Gemini pricing plans and token billing — understand the cost differences across LLM APIs and find the best value.