RAG: Implementación Práctica

RAG (Retrieval Augmented Generation) es el patrón más importante en aplicaciones AI-First. Permite que un LLM responda preguntas usando tus datos propios, reduciendo alucinaciones y eliminando la necesidad de re-entrenar el modelo.

Arquitectura RAG

Pregunta del usuario
        │
        ▼
┌──────────────────┐
│   1. EMBEDDING   │  Convertir pregunta a vector
│   de la query    │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   2. RETRIEVAL   │  Buscar documentos similares en vector DB
│   (búsqueda)     │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  3. RE-RANKING   │  Reordenar por relevancia (opcional)
│                  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  4. GENERATION   │  LLM genera respuesta usando los docs como contexto
│  (LLM + contexto)│
└──────────────────┘

Paso 1: Ingesta y Chunking

¿Por qué chunking?

Los documentos largos no caben en un solo embedding. Hay que dividirlos en chunks de tamaño manejable:

interface Chunk {
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    totalChunks: number;
  };
}

// Chunking simple por caracteres con overlap
function chunkText(text: string, options: {
  chunkSize?: number;
  overlap?: number;
} = {}): string[] {
  const { chunkSize = 1000, overlap = 200 } = options;
  const chunks: string[] = [];

  for (let i = 0; i < text.length; i += chunkSize - overlap) {
    chunks.push(text.slice(i, i + chunkSize));
  }

  return chunks;
}

Chunking inteligente (por secciones)

function chunkByHeadings(markdown: string): Chunk[] {
  const sections = markdown.split(/^##\s+/gm);
  const chunks: Chunk[] = [];

  for (const section of sections) {
    if (section.trim().length < 50) continue; // Ignorar secciones vacías

    // Si la sección es muy larga, subdividir
    if (section.length > 2000) {
      const subChunks = chunkText(section, { chunkSize: 1000, overlap: 200 });
      chunks.push(...subChunks.map((c, i) => ({
        content: c,
        metadata: { source: '', chunkIndex: i, totalChunks: subChunks.length },
      })));
    } else {
      chunks.push({
        content: section,
        metadata: { source: '', chunkIndex: 0, totalChunks: 1 },
      });
    }
  }

  return chunks;
}

Chunking recursivo (LangChain style)

function recursiveChunk(
  text: string,
  maxSize: number = 1000,
  separators: string[] = ['\n\n', '\n', '. ', ' ']
): string[] {
  if (text.length <= maxSize) return [text];

  const separator = separators[0];
  const parts = text.split(separator);
  const chunks: string[] = [];
  let current = '';

  for (const part of parts) {
    if ((current + separator + part).length > maxSize && current) {
      chunks.push(current.trim());
      current = part;
    } else {
      current = current ? current + separator + part : part;
    }
  }

  if (current) {
    if (current.length > maxSize && separators.length > 1) {
      chunks.push(...recursiveChunk(current, maxSize, separators.slice(1)));
    } else {
      chunks.push(current.trim());
    }
  }

  return chunks;
}

Paso 2: Pipeline de Ingesta Completo

import OpenAI from 'openai';
import { readFileSync, readdirSync } from 'fs';
import { join } from 'path';

class RAGIngester {
  private openai: OpenAI;

  constructor() {
    this.openai = new OpenAI();
  }

  async ingestDirectory(dirPath: string) {
    const files = readdirSync(dirPath).filter(f => f.endsWith('.md'));
    console.log(`Procesando ${files.length} archivos...`);

    for (const file of files) {
      const content = readFileSync(join(dirPath, file), 'utf-8');
      const chunks = recursiveChunk(content, 1000);

      // Generar embeddings en batch
      const response = await this.openai.embeddings.create({
        model: 'text-embedding-3-small',
        input: chunks,
      });

      // Insertar en la base de datos
      for (let i = 0; i < chunks.length; i++) {
        await db.query(
          `INSERT INTO documents (content, embedding, metadata)
           VALUES ($1, $2::vector, $3)`,
          [
            chunks[i],
            JSON.stringify(response.data[i].embedding),
            JSON.stringify({
              source: file,
              chunk_index: i,
              total_chunks: chunks.length,
            }),
          ]
        );
      }

      console.log(`  ✓ ${file}: ${chunks.length} chunks indexados`);
    }
  }
}

Paso 3: Retrieval

Búsqueda semántica básica

async function retrieve(query: string, topK = 5) {
  // Embedding de la pregunta
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  });
  const queryEmbedding = response.data[0].embedding;

  // Búsqueda por similitud coseno
  const results = await db.query(`
    SELECT content, metadata,
           1 - (embedding <=> $1::vector) AS similarity
    FROM documents
    WHERE 1 - (embedding <=> $1::vector) > 0.5
    ORDER BY embedding <=> $1::vector
    LIMIT $2
  `, [JSON.stringify(queryEmbedding), topK]);

  return results.rows;
}

Búsqueda híbrida (semántica + keywords)

async function hybridSearch(query: string, topK = 5) {
  const embedding = await getEmbedding(query);

  // Combina búsqueda semántica con full-text search
  const results = await db.query(`
    WITH semantic AS (
      SELECT id, content, metadata,
             1 - (embedding <=> $1::vector) AS semantic_score
      FROM documents
      ORDER BY embedding <=> $1::vector
      LIMIT 20
    ),
    keyword AS (
      SELECT id, content, metadata,
             ts_rank(to_tsvector('spanish', content),
                     plainto_tsquery('spanish', $2)) AS keyword_score
      FROM documents
      WHERE to_tsvector('spanish', content) @@ plainto_tsquery('spanish', $2)
      LIMIT 20
    )
    SELECT COALESCE(s.id, k.id) AS id,
           COALESCE(s.content, k.content) AS content,
           COALESCE(s.metadata, k.metadata) AS metadata,
           COALESCE(s.semantic_score, 0) * 0.7 +
           COALESCE(k.keyword_score, 0) * 0.3 AS combined_score
    FROM semantic s
    FULL OUTER JOIN keyword k ON s.id = k.id
    ORDER BY combined_score DESC
    LIMIT $3
  `, [JSON.stringify(embedding), query, topK]);

  return results.rows;
}

Paso 4: Generación con contexto

async function ragChat(query: string): Promise<string> {
  // 1. Recuperar documentos relevantes
  const docs = await retrieve(query, 5);

  // 2. Construir contexto
  const context = docs
    .map((doc, i) => `[Documento ${i + 1}] (Fuente: ${doc.metadata.source})\n${doc.content}`)
    .join('\n\n---\n\n');

  // 3. Generar respuesta con LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Eres un asistente que responde preguntas basándote en los documentos proporcionados.

REGLAS:
- Responde SOLO basándote en el contexto proporcionado
- Si la información no está en los documentos, di "No tengo información sobre eso"
- Cita las fuentes relevantes al final de tu respuesta
- Sé preciso y conciso`,
      },
      {
        role: 'user',
        content: `CONTEXTO:\n${context}\n\n---\n\nPREGUNTA: ${query}`,
      },
    ],
    temperature: 0.3, // Baja temperatura para factualidad
  });

  return response.choices[0].message.content!;
}

RAG Avanzado

Re-ranking con LLM

async function rerankWithLLM(query: string, docs: Document[]): Promise<Document[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: 'Ordena los documentos por relevancia para la pregunta. Retorna JSON: { "ranking": [index] }',
      },
      {
        role: 'user',
        content: `Pregunta: ${query}\n\nDocumentos:\n${docs.map((d, i) => `[${i}]: ${d.content.slice(0, 200)}`).join('\n')}`,
      },
    ],
  });

  const { ranking } = JSON.parse(response.choices[0].message.content!);
  return ranking.map((i: number) => docs[i]);
}

Contextual chunking

// Agrega contexto del documento al chunk antes de embedear
async function contextualChunk(fullDoc: string, chunk: string): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Dado este documento:\n${fullDoc.slice(0, 2000)}\n\nResume en 1-2 oraciones el contexto de este fragmento:\n${chunk}`,
    }],
    max_tokens: 100,
  });

  return `${response.choices[0].message.content}\n\n${chunk}`;
}

Query expansion

async function expandQuery(query: string): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Genera 3 reformulaciones de esta pregunta para mejorar la búsqueda:\n"${query}"\nRetorna solo las 3 preguntas, una por línea.`,
    }],
  });

  const expanded = response.choices[0].message.content!.split('\n').filter(Boolean);
  return [query, ...expanded]; // Buscar con la original + reformulaciones
}

Clase RAG completa

class RAGService {
  private openai: OpenAI;

  constructor() {
    this.openai = new OpenAI();
  }

  async query(question: string): Promise<{
    answer: string;
    sources: { content: string; source: string; similarity: number }[];
  }> {
    // 1. Retrieval
    const docs = await hybridSearch(question, 10);

    // 2. Re-ranking
    const reranked = await rerankWithLLM(question, docs);
    const topDocs = reranked.slice(0, 5);

    // 3. Generation
    const context = topDocs
      .map((d, i) => `[${i + 1}] ${d.content}`)
      .join('\n\n');

    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: RAG_SYSTEM_PROMPT },
        { role: 'user', content: `Contexto:\n${context}\n\nPregunta: ${question}` },
      ],
      temperature: 0.2,
    });

    return {
      answer: response.choices[0].message.content!,
      sources: topDocs.map(d => ({
        content: d.content.slice(0, 200),
        source: d.metadata.source,
        similarity: d.similarity,
      })),
    };
  }
}

RAG: Implementación Práctica

Estás en modo lectura

RAG: Implementación Práctica

Arquitectura RAG

Paso 1: Ingesta y Chunking

¿Por qué chunking?

Chunking inteligente (por secciones)

Chunking recursivo (LangChain style)

Paso 2: Pipeline de Ingesta Completo

Paso 3: Retrieval

Búsqueda semántica básica

Búsqueda híbrida (semántica + keywords)

Paso 4: Generación con contexto

RAG Avanzado

Re-ranking con LLM

Contextual chunking

Query expansion

Clase RAG completa

Ejercicio práctico disponible

¿Te gustó esta lección?