Recipes · extraction

Structured JSON Extraction

Extrae datos estructurados de texto no estructurado con alta fiabilidad. Usa schemas Zod para validar y tipar el output del modelo.

model-agnosticextractionjsonzodtypescriptparsingActualizado 2026-04-23

System Prompt

You are a data extraction engine. Extract information from the provided text
and return it as valid JSON matching the schema described by the user.

## Rules
- Return ONLY valid JSON — no markdown, no code fences, no explanation.
- If a field is not present in the text, use null for optional fields or omit
  the key if the schema marks it as optional.
- Do not invent values. If the text is ambiguous, use null.
- Dates must be in ISO 8601 format (YYYY-MM-DD).
- Numbers must be numbers, not strings.

User Prompt template

Extract the following fields from the text below.

Schema:
{{SCHEMA_DESCRIPTION}}

Text:
{{INPUT_TEXT}}

Implementación completa (TypeScript + Zod)

typescript
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
 
const client = new Anthropic();
 
async function extractStructured<T extends z.ZodTypeAny>(
  text: string,
  schema: T,
  schemaDescription: string,
): Promise<z.infer<T>> {
  const response = await client.messages.create({
    model: "claude-haiku-3-5", // Haiku es suficiente para extracción simple
    max_tokens: 2048,
    system: STRUCTURED_JSON_SYSTEM_PROMPT,
    messages: [
      {
        role: "user",
        content: `Extract the following fields from the text below.\n\nSchema:\n${schemaDescription}\n\nText:\n${text}`,
      },
    ],
  });
 
  const rawText = response.content
    .filter((b): b is Anthropic.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("")
    .trim();
 
  // Strip accidental markdown code fences
  const cleaned = rawText
    .replace(/^```(?:json)?\n?/, "")
    .replace(/\n?```$/, "")
    .trim();
 
  const parsed = JSON.parse(cleaned) as unknown;
  return schema.parse(parsed);
}
 
// --- Ejemplo de uso ---
 
const InvoiceSchema = z.object({
  invoice_number: z.string(),
  date: z.string().nullable(),
  vendor: z.string(),
  total_amount: z.number().nullable(),
  currency: z.string().default("USD"),
  line_items: z.array(
    z.object({
      description: z.string(),
      quantity: z.number().nullable(),
      unit_price: z.number().nullable(),
    }),
  ),
});
 
const schemaDescription = `
{
  invoice_number: string,
  date: string | null,       // ISO 8601 format
  vendor: string,
  total_amount: number | null,
  currency: string,          // default "USD"
  line_items: Array<{
    description: string,
    quantity: number | null,
    unit_price: number | null
  }>
}`;
 
const invoice = await extractStructured(
  rawInvoiceText,
  InvoiceSchema,
  schemaDescription,
);
 
console.log(invoice.total_amount); // number | null, fully typed

Manejo de errores de parsing

typescript
async function safeExtract<T extends z.ZodTypeAny>(
  text: string,
  schema: T,
  schemaDescription: string,
  retries = 2,
): Promise<z.infer<T> | null> {
  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      return await extractStructured(text, schema, schemaDescription);
    } catch (err) {
      if (attempt === retries) {
        console.error("Extraction failed after retries:", err);
        return null;
      }
      // On retry, tell the model what went wrong
      console.warn(`Extraction attempt ${attempt + 1} failed, retrying…`);
    }
  }
  return null;
}

Optimización con Prompt Caching

Si procesas muchos documentos con el mismo schema, cachea el system prompt y la descripción del schema:

typescript
const response = await client.messages.create({
  model: "claude-haiku-3-5",
  max_tokens: 2048,
  system: [
    {
      type: "text",
      text: STRUCTURED_JSON_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: `Schema:\n${schemaDescription}\n\nText:\n`,
          cache_control: { type: "ephemeral" }, // cachea la descripción del schema
        },
        {
          type: "text",
          text: documentText, // solo este bloque cambia por documento
        },
      ],
    },
  ],
});

Para extracciones en batch, usa claude-haiku-3-5 — es ~10x más rápido y barato que Opus. Reserva Opus para documentos complejos o ambiguos donde la extracción simple falla.

Schemas comunes

Caso de usoCampos típicos
Facturasinvoice_number, date, vendor, total, line_items
CVsname, email, experience[], skills[], education[]
Artículos de noticiastitle, author, published_at, topics[], sentiment
Contratosparties[], effective_date, termination_date, obligations[]
Tickets de soporteid, priority, category, customer_id, description