Back to HomeAI API

RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026

10 min min read
#RAG#LLM#Embedding#Vector Database#Pinecone#Qdrant#LangChain#Retrieval-Augmented#AI Application Development#Enterprise AI

RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026

Stop AI from Making Things Up -- RAG Is the Most Practical Solution Right Now

Have you ever experienced this:

You ask the AI "What's our company's vacation policy?" and it gives an answer that sounds perfectly reasonable -- but is completely wrong about your company's rules?

The AI isn't lying to you. It genuinely doesn't know your company's policy -- because your internal data wasn't in its training set.

RAG (Retrieval-Augmented Generation) is the technique that makes AI "look up the data first" before answering.

This tutorial takes you through building a RAG system from scratch. No AI PhD required -- basic Python skills are enough.

Want to build a RAG system? CloudInsight helps you choose the best LLM API, with enterprise plan discounts.

Developer building a RAG Pipeline in Jupyter Notebook on a laptop

TL;DR

RAG system core workflow: Document chunking -> Embedding vectorization -> Store in vector database -> User asks question -> Vector search finds relevant data -> Inject data into prompt -> LLM generates answer. Recommended stack: OpenAI Embedding + Qdrant/Pinecone + Claude Sonnet/GPT-4o. You can build a prototype in one day.


RAG Architecture Design & Component Selection

Answer-First: A RAG system consists of three major components -- an Embedding model (text to vectors), a vector database (store and search vectors), and an LLM (generate answers). Each component has multiple options, and the right combination is key to maximizing effectiveness.

Complete RAG Architecture Diagram

                    Offline Pipeline (Build Knowledge Base)
                    ══════════════════════════════════════
Documents/Data -> Chunking -> Embedding -> Vector Database
                 (Splitting) (Vectorize)   (Storage)

                    Online Pipeline (Answer Questions)
                    ══════════════════════════════════
User Question -> Embedding -> Vector Search -> Get Relevant Docs
                    |                              |
              Query Vector                   Relevant Chunks
                                                   |
                                     Compose Prompt -> LLM -> Answer

Core Component Selection

ComponentRecommendedAlternatives
EmbeddingOpenAI text-embedding-3-smallCohere embed-v3, BGE-M3
Vector DBQdrant (self-hosted) or Pinecone (managed)Weaviate, Chroma, pgvector
LLMClaude Sonnet or GPT-4oGPT-4o-mini, Gemini Pro
FrameworkLangChain or LlamaIndexCustom-built (more flexible)

Document Chunking Strategy

This is critical to RAG system quality. Chunks too large reduce search precision. Too small and they lack context.

StrategyChunk SizeOverlapBest For
Fixed size500 tokens50 tokensGeneral use
Paragraph splitBy paragraph1-2 sentencesStructured docs
Semantic splitDynamicAutomaticMixed content
Recursive split200-100010-20%Code files

Our recommendation: Start with fixed size (500 tokens, 50 tokens overlap), then adjust based on retrieval quality.

For core LLM technology principles and model differences, refer to What Is an LLM? Beginner's Guide.


Choosing the Best LLM API for RAG

Answer-First: In a RAG system, the LLM's job is to "generate answers based on retrieved data." What's needed is instruction-following ability and citation accuracy -- not necessarily the most powerful model.

LLM Performance in RAG Scenarios

CapabilityGPT-4oClaude SonnetGemini ProGPT-4o-mini
Instruction following9/1010/108/108/10
Citation accuracy8/109/108/107/10
Refusal ability7/109/107/106/10
Long context128K200K1M128K
Cost efficiencyMediumMediumGoodExcellent

Why does "refusal ability" matter?

In a RAG system, if retrieved data isn't sufficient to answer the user's question, the AI should say "I'm not sure" rather than fabricating an answer. Claude excels at this -- it honestly tells you "Based on the provided data, I cannot answer this question."

Cost Estimates

Assuming your RAG system receives 100 questions per day, with about 2,000 tokens of context per question:

ModelDaily CostMonthly Cost
GPT-4o~$0.50~$15
Claude Sonnet~$0.60~$18
Gemini Pro~$0.25~$7.5
GPT-4o-mini~$0.03~$1

Embedding costs are separate (typically very low, around $1-5/month).

For more cost analysis, refer to AI API Pricing Comparison.

CloudInsight offers LLM API enterprise purchasing with discounts and technical support. Get a quote for RAG system API purchasing ->


Embedding & Vector Database Setup

Answer-First: Embedding is the process of converting text into numerical vectors. A vector database is a specialized database for storing and searching these vectors. Together, they enable your RAG system to quickly find the most relevant data.

How Embedding Works

In simple terms:

"The capital of Taiwan is Taipei"
  -> Embedding model
    -> [0.023, -0.156, 0.891, ...] (1536 numbers)

Semantically similar sentences produce vectors that are close together. So searching just requires comparing "distances" between vectors.

Implementation: Building a RAG Pipeline

# Step 1: Install required packages
# pip install openai qdrant-client langchain

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Step 2: Initialize
openai_client = OpenAI()
qdrant = QdrantClient(":memory:")  # In-memory mode for development

# Step 3: Create vector collection
qdrant.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Step 4: Embed documents and store
def embed_and_store(documents):
    for i, doc in enumerate(documents):
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["content"]
        )
        qdrant.upsert(
            collection_name="knowledge_base",
            points=[PointStruct(
                id=i,
                vector=response.data[0].embedding,
                payload={"content": doc["content"], "source": doc["source"]}
            )]
        )

# Step 5: Search for relevant documents
def search(query, top_k=3):
    query_vector = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    results = qdrant.search(
        collection_name="knowledge_base",
        query_vector=query_vector,
        limit=top_k
    )
    return results

# Step 6: RAG answer generation
def rag_answer(question):
    results = search(question)
    context = "\n".join([r.payload["content"] for r in results])

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer the question based on the following data. If the data doesn't contain relevant information, say 'Based on available data, I cannot answer this question.'\n\nData:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

Vector Database Comparison

DatabaseDeploymentFree TierScaleHighlights
PineconeManaged1M vectorsMedium-largeSimplest
QdrantSelf-hosted/ManagedOpen source freeAnyMost features
WeaviateSelf-hosted/ManagedOpen source freeMedium-largeGraphQL interface
ChromaSelf-hostedOpen source freeSmallMost lightweight
pgvectorSelf-hosted (PostgreSQL)Open source freeSmall-mediumIntegrates with existing DB

Vector database management interface showing document embedding status


RAG Performance Optimization Tips

Answer-First: RAG quality depends on three stages -- retrieval quality, context assembly, and generation quality. Most issues stem from insufficient retrieval quality.

Common Problems & Solutions

Problem 1: Irrelevant search results

Solutions:

  • Try different chunk sizes
  • Use Hybrid Search (vector search + keyword search)
  • Add metadata filtering (e.g., document type, date)

Problem 2: Incomplete answers

Solutions:

  • Increase top_k (number of search results)
  • Use multi-step retrieval (broad search first, then refine)
  • Increase chunk overlap

Problem 3: Hallucinations in answers

Solutions:

  • Emphasize "answer only based on provided data" in the System Prompt
  • Require the LLM to cite sources
  • Use a more instruction-following model (Claude Sonnet)

Problem 4: Too slow

Solutions:

  • Use a smaller embedding model (text-embedding-3-small)
  • Reduce top_k
  • Use streaming responses
  • Cache common Q&A pairs

Advanced Optimization: Reranking

After retrieval, use a reranker model to re-sort results:

# Using Cohere Rerank
import cohere
co = cohere.Client()

results = co.rerank(
    query="What is the return policy",
    documents=["search result 1", "search result 2", "search result 3"],
    model="rerank-v3.5"
)

Reranking can boost search accuracy from 70% to 85%+.

For more technical tutorials, check out API Tutorial Beginner's Guide and AI Code Generation Guide.

Side-by-side comparison showing RAG answer quality before and after optimization


FAQ: RAG Application Common Questions

How many document formats can RAG handle?

RAG itself isn't format-limited. As long as you can convert a document to text, it can go into a RAG system. Commonly supported: PDF, Word, Excel, HTML, Markdown, plain text. Images and tables require additional OCR or multimodal processing.

How big of a server does a RAG system need?

It depends on the architecture. If using cloud services (OpenAI API + Pinecone), no server needed at all. If fully self-hosted (open-source LLM + Qdrant), we recommend at least 16GB RAM and a 4-core CPU. GPU isn't necessarily required (unless self-hosting an LLM).

Can RAG be used for real-time chatbots?

Yes, but speed optimization is needed. We recommend streaming responses, caching common Q&A pairs, and choosing fast models (GPT-4o-mini or Gemini Flash). Typical latency: 2-5 seconds.

How often should the knowledge base be updated?

It depends on how frequently your data changes. We recommend setting up an automated pipeline: document update -> re-embedding -> write to vector database. This process can be automated with CI/CD tools.

LangChain vs LlamaIndex -- which is better?

LangChain is more comprehensive, suited for complex AI applications. LlamaIndex focuses on RAG with simpler setup. If you're only doing RAG, go with LlamaIndex. If you also need Agent functionality and more, choose LangChain.

For complete LLM and RAG strategy guides, return to LLM & RAG Application Guide.


Conclusion: RAG Is the Most Pragmatic Enterprise AI Solution

No fine-tuning needed. No training your own model.

RAG lets you use off-the-shelf LLM APIs + your own data to build reliable AI applications.

Next steps:

  1. Prepare your knowledge base documents
  2. Choose your Embedding and LLM APIs
  3. Build a prototype using the code above
  4. Test, optimize, deploy

The entire process can produce a prototype in one day. Don't over-engineer -- build it first, then iterate.


Get a Quote for Enterprise Plans

CloudInsight offers LLM API enterprise purchasing for RAG systems:

  • Unified purchasing of OpenAI Embedding + Claude/GPT generation APIs
  • Enterprise-exclusive discounts to reduce RAG system operating costs
  • Invoices included, technical support available

Get a quote for enterprise plans -> | Join LINE for instant consultation ->



References

  1. OpenAI - Embedding Models Documentation (2026)
  2. Qdrant - Official Documentation (2026)
  3. Pinecone - Vector Database Guide (2026)
  4. LangChain - RAG Tutorial (2026)
  5. Cohere - Rerank API Documentation (2026)
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "RAG Tutorial | Build a Retrieval-Augmented Generation System with LLM APIs in 2026",
  "author": {
    "@type": "Person",
    "name": "CloudInsight Technical Team",
    "url": "https://cloudinsight.cc/about"
  },
  "datePublished": "2026-03-21",
  "dateModified": "2026-03-22",
  "publisher": {
    "@type": "Organization",
    "name": "CloudInsight",
    "url": "https://cloudinsight.cc"
  }
}
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "How big of a server does a RAG system need?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "If using cloud services (OpenAI API + Pinecone), no server is needed at all. If fully self-hosted, we recommend at least 16GB RAM and a 4-core CPU."
      }
    },
    {
      "@type": "Question",
      "name": "Can RAG be used for real-time chatbots?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes, but speed optimization is needed. We recommend streaming responses, caching common Q&A pairs, and choosing fast models. Typical latency: 2-5 seconds."
      }
    },
    {
      "@type": "Question",
      "name": "LangChain vs LlamaIndex -- which is better?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "LangChain is more comprehensive, suited for complex AI applications. LlamaIndex focuses on RAG with simpler setup. For RAG only, go with LlamaIndex."
      }
    }
  ]
}

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

Related Articles