--- title: Extract Structured Data subtitle: Get consistent, typed output from your tasks slug: running-automations/extract-structured-data --- export const FIELD_TYPES = [ { value: "string", label: "String" }, { value: "number", label: "Number" }, { value: "integer", label: "Integer" }, { value: "boolean", label: "Boolean" }, ] export const SchemaBuilder = () => { const [schemaType, setSchemaType] = useState("single") const [arrayName, setArrayName] = useState("items") const [fields, setFields] = useState([ { id: "1", name: "title", type: "string", description: "The title" }, ]) const [outputFormat, setOutputFormat] = useState("python") const [copied, setCopied] = useState(false) const addField = () => { setFields([...fields, { id: String(Date.now()), name: "", type: "string", description: "" }]) } const removeField = (id) => { if (fields.length > 1) setFields(fields.filter((f) => f.id !== id)) } const updateField = (id, key, value) => { setFields(fields.map((f) => (f.id === id ? { ...f, [key]: value } : f))) } const duplicateNames = useMemo(() => { const names = fields.map((f) => f.name).filter((n) => n.trim() !== "") const counts = {} for (const n of names) { counts[n] = (counts[n] || 0) + 1 } return new Set(Object.keys(counts).filter((n) => counts[n] > 1)) }, [fields]) const schema = useMemo(() => { const properties = {} fields.forEach((field) => { if (field.name) { properties[field.name] = { type: field.type, description: field.description || `The ${field.name}`, } } }) if (schemaType === "array") { return { type: "object", properties: { [arrayName]: { type: "array", description: "List of extracted items", items: { type: "object", properties }, }, }, } } return { type: "object", properties } }, [fields, schemaType, arrayName]) const formattedOutput = useMemo(() => { const jsonStr = JSON.stringify(schema, null, 2) if (outputFormat === "python") { return `data_extraction_schema=${jsonStr.replace(/: null/g, ": None").replace(/: true/g, ": True").replace(/: false/g, ": False")}` } if (outputFormat === "typescript") { return `data_extraction_schema: ${jsonStr}` } return `"data_extraction_schema": ${jsonStr}` }, [schema, outputFormat]) const copyToClipboard = async () => { await navigator.clipboard.writeText(formattedOutput) setCopied(true) setTimeout(() => setCopied(false), 2000) } return (
{[ { value: "single", label: "Single object", desc: "Extract one item with multiple fields" }, { value: "array", label: "List of items", desc: "Extract multiple items with the same structure" }, ].map((type) => ( ))}
{schemaType === "array" && (
setArrayName(e.target.value)} className="w-full p-2 border rounded-md text-sm" placeholder="items" />
)}
{fields.map((field) => (
updateField(field.id, "name", e.target.value)} placeholder="Field name" className={`w-32 p-2 border rounded-md text-sm ${duplicateNames.has(field.name) ? "border-red-500 bg-red-50" : ""}`} title={duplicateNames.has(field.name) ? "Duplicate field name - will be overwritten in schema" : ""} /> updateField(field.id, "description", e.target.value)} placeholder="Description (helps AI understand what to extract)" className="flex-1 p-2 border rounded-md text-sm" />
))}
{duplicateNames.size > 0 && (
Duplicate field names detected. Only the last field with each name will appear in the schema.
)}
{["python", "typescript", "curl"].map((format) => ( ))}
            {formattedOutput}
          
) } By default, Skyvern returns extracted data in whatever format makes sense for the task. Pass a `data_extraction_schema` to enforce a specific structure using [JSON Schema.](https://json-schema.org/) --- ## Define a schema Add `data_extraction_schema` parameter to your task with a JSON Schema object: ```python Python result = await client.run_task( prompt="Get the title of the top post", url="https://news.ycombinator.com", data_extraction_schema={ "type": "object", "properties": { "title": { "type": "string", "description": "The title of the top post" } } } ) ``` ```typescript TypeScript const result = await client.runTask({ body: { prompt: "Get the title of the top post", url: "https://news.ycombinator.com", data_extraction_schema: { type: "object", properties: { title: { type: "string", description: "The title of the top post", }, }, }, }, }); ``` ```bash cURL curl -X POST "https://api.skyvern.com/v1/run/tasks" \ -H "x-api-key: $SKYVERN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Get the title of the top post", "url": "https://news.ycombinator.com", "data_extraction_schema": { "type": "object", "properties": { "title": { "type": "string", "description": "The title of the top post" } } } }' ``` The `description` field in each property helps Skyvern understand what data to extract. Be specific. `description` fields drive extraction quality. Vague descriptions like "the data" produce vague results. Be specific: "The product price in USD, without currency symbol." --- ## Schema format Skyvern uses standard JSON Schema. Common types: | Type | JSON Schema | Example value | |------|-------------|---------------| | String | `{"type": "string"}` | `"Hello world"` | | Number | `{"type": "number"}` | `19.99` | | Integer | `{"type": "integer"}` | `42` | | Boolean | `{"type": "boolean"}` | `true` | | Array | `{"type": "array", "items": {...}}` | `[1, 2, 3]` | | Object | `{"type": "object", "properties": {...}}` | `{"key": "value"}` | A schema doesn't guarantee all fields are populated. If the data isn't on the page, fields return `null`. Design your code to handle missing values. --- ## Build your schema Use the interactive builder to generate a schema, then copy it into your code. --- ## Examples ### Single value Extract one piece of information, such as the current price of Bitcoin: ```python Python result = await client.run_task( prompt="Get the current Bitcoin price in USD", url="https://coinmarketcap.com/currencies/bitcoin/", data_extraction_schema={ "type": "object", "properties": { "price": { "type": "number", "description": "Current Bitcoin price in USD" } } } ) ``` ```typescript TypeScript const result = await client.runTask({ body: { prompt: "Get the current Bitcoin price in USD", url: "https://coinmarketcap.com/currencies/bitcoin/", data_extraction_schema: { type: "object", properties: { price: { type: "number", description: "Current Bitcoin price in USD", }, }, }, }, }); ``` ```bash cURL curl -X POST "https://api.skyvern.com/v1/run/tasks" \ -H "x-api-key: $SKYVERN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Get the current Bitcoin price in USD", "url": "https://coinmarketcap.com/currencies/bitcoin/", "data_extraction_schema": { "type": "object", "properties": { "price": { "type": "number", "description": "Current Bitcoin price in USD" } } } }' ``` **Output (when completed):** ```json { "price": 104521.37 } ``` --- ### List of items Extract multiple items with the same structure, such as the top posts from a news site: ```python Python result = await client.run_task( prompt="Get the top 5 posts", url="https://news.ycombinator.com", data_extraction_schema={ "type": "object", "properties": { "posts": { "type": "array", "description": "Top 5 posts from the front page", "items": { "type": "object", "properties": { "title": { "type": "string", "description": "Post title" }, "points": { "type": "integer", "description": "Number of points" }, "url": { "type": "string", "description": "Link to the post" } } } } } } ) ``` ```typescript TypeScript const result = await client.runTask({ body: { prompt: "Get the top 5 posts", url: "https://news.ycombinator.com", data_extraction_schema: { type: "object", properties: { posts: { type: "array", description: "Top 5 posts from the front page", items: { type: "object", properties: { title: { type: "string", description: "Post title", }, points: { type: "integer", description: "Number of points", }, url: { type: "string", description: "Link to the post", }, }, }, }, }, }, }, }); ``` ```bash cURL curl -X POST "https://api.skyvern.com/v1/run/tasks" \ -H "x-api-key: $SKYVERN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Get the top 5 posts", "url": "https://news.ycombinator.com", "data_extraction_schema": { "type": "object", "properties": { "posts": { "type": "array", "description": "Top 5 posts from the front page", "items": { "type": "object", "properties": { "title": { "type": "string", "description": "Post title" }, "points": { "type": "integer", "description": "Number of points" }, "url": { "type": "string", "description": "Link to the post" } } } } } } }' ``` **Output (when completed):** ```json { "posts": [ { "title": "Running Claude Code dangerously (safely)", "points": 342, "url": "https://blog.emilburzo.com/2026/01/running-claude-code-dangerously-safely/" }, { "title": "Linux kernel framework for PCIe device emulation", "points": 287, "url": "https://github.com/cakehonolulu/pciem" }, { "title": "I'm addicted to being useful", "points": 256, "url": "https://www.seangoedecke.com/addicted-to-being-useful/" }, { "title": "Level S4 solar radiation event", "points": 198, "url": "https://www.swpc.noaa.gov/news/g4-severe-geomagnetic-storm" }, { "title": "WebAssembly Text Format parser performance", "points": 176, "url": "https://blog.gplane.win/posts/improve-wat-parser-perf.html" } ] } ``` Arrays without limits extract everything visible on the page. Specify limits in your prompt (e.g., "top 5 posts") or the array description to control output size. --- ### Nested objects Extract hierarchical data, such as a product with its pricing and availability: ```python Python result = await client.run_task( prompt="Get product details including pricing and availability", url="https://www.amazon.com/dp/B0EXAMPLE", data_extraction_schema={ "type": "object", "properties": { "product": { "type": "object", "description": "Product information", "properties": { "name": { "type": "string", "description": "Product name" }, "pricing": { "type": "object", "description": "Pricing details", "properties": { "current_price": { "type": "number", "description": "Current price in USD" }, "original_price": { "type": "number", "description": "Original price before discount" }, "discount_percent": { "type": "integer", "description": "Discount percentage" } } }, "availability": { "type": "object", "description": "Stock information", "properties": { "in_stock": { "type": "boolean", "description": "Whether the item is in stock" }, "delivery_estimate": { "type": "string", "description": "Estimated delivery date" } } } } } } } ) ``` ```typescript TypeScript const result = await client.runTask({ body: { prompt: "Get product details including pricing and availability", url: "https://www.amazon.com/dp/B0EXAMPLE", data_extraction_schema: { type: "object", properties: { product: { type: "object", description: "Product information", properties: { name: { type: "string", description: "Product name", }, pricing: { type: "object", description: "Pricing details", properties: { current_price: { type: "number", description: "Current price in USD", }, original_price: { type: "number", description: "Original price before discount", }, discount_percent: { type: "integer", description: "Discount percentage", }, }, }, availability: { type: "object", description: "Stock information", properties: { in_stock: { type: "boolean", description: "Whether the item is in stock", }, delivery_estimate: { type: "string", description: "Estimated delivery date", }, }, }, }, }, }, }, }, }); ``` ```bash cURL curl -X POST "https://api.skyvern.com/v1/run/tasks" \ -H "x-api-key: $SKYVERN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "prompt": "Get product details including pricing and availability", "url": "https://www.amazon.com/dp/B0EXAMPLE", "data_extraction_schema": { "type": "object", "properties": { "product": { "type": "object", "description": "Product information", "properties": { "name": { "type": "string", "description": "Product name" }, "pricing": { "type": "object", "description": "Pricing details", "properties": { "current_price": { "type": "number", "description": "Current price in USD" }, "original_price": { "type": "number", "description": "Original price before discount" }, "discount_percent": { "type": "integer", "description": "Discount percentage" } } }, "availability": { "type": "object", "description": "Stock information", "properties": { "in_stock": { "type": "boolean", "description": "Whether the item is in stock" }, "delivery_estimate": { "type": "string", "description": "Estimated delivery date" } } } } } } } }' ``` **Output (when completed):** ```json { "product": { "name": "Wireless Bluetooth Headphones", "pricing": { "current_price": 79.99, "original_price": 129.99, "discount_percent": 38 }, "availability": { "in_stock": true, "delivery_estimate": "Tomorrow, Jan 21" } } } ``` --- ## Accessing extracted data The extracted data appears in the `output` field of the completed run. Poll until the task reaches a terminal state, then access the output. ```python Python result = await client.run_task( prompt="Get the top post", url="https://news.ycombinator.com", data_extraction_schema={ "type": "object", "properties": { "title": {"type": "string", "description": "Post title"}, "points": {"type": "integer", "description": "Points"} } } ) run_id = result.run_id while True: run = await client.get_run(run_id) if run.status in ["completed", "failed", "terminated", "timed_out", "canceled"]: break await asyncio.sleep(5) # Access the extracted data print(f"Output: {run.output}") ``` ```typescript TypeScript const result = await client.runTask({ body: { prompt: "Get the top post", url: "https://news.ycombinator.com", data_extraction_schema: { type: "object", properties: { title: { type: "string", description: "Post title" }, points: { type: "integer", description: "Points" }, }, }, }, }); const runId = result.run_id; while (true) { const run = await client.getRun(runId); if (["completed", "failed", "terminated", "timed_out", "canceled"].includes(run.status)) { console.log(`Output: ${JSON.stringify(run.output)}`); break; } await new Promise((resolve) => setTimeout(resolve, 5000)); } ``` ```bash cURL RUN_ID="your_run_id_here" while true; do RESPONSE=$(curl -s -X GET "https://api.skyvern.com/v1/runs/$RUN_ID" \ -H "x-api-key: $SKYVERN_API_KEY") STATUS=$(echo "$RESPONSE" | jq -r '.status') if [[ "$STATUS" == "completed" || "$STATUS" == "failed" || "$STATUS" == "terminated" || "$STATUS" == "timed_out" || "$STATUS" == "canceled" ]]; then echo "$RESPONSE" | jq '.output' break fi sleep 5 done ``` If using webhooks, the same `output` field appears in the webhook payload. --- ## Next steps All available parameters for run_task Execute tasks and retrieve results