# Skyvern Parsing Examples Примеры использования Skyvern для парсинга различных сайтов. ## Базовые команды ### 1. Простое извлечение текста ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://example.com", "navigation_goal": "Navigate to the page and extract heading", "data_extraction_goal": "Extract the main h1 heading", "proxy_location": "NONE" }' ``` ### 2. Извлечение структурированных данных ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://news.ycombinator.com", "navigation_goal": "Extract top stories from Hacker News", "data_extraction_goal": "Extract titles and URLs of top 5 stories", "extracted_information_schema": { "type": "object", "properties": { "stories": { "type": "array", "items": { "type": "object", "properties": { "title": {"type": "string"}, "url": {"type": "string"}, "points": {"type": "number"} } } } } }, "proxy_location": "NONE", "max_steps_per_run": 10 }' ``` ### 3. Поиск и клик ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://www.google.com/search?q=skyvern+github", "navigation_goal": "Click on the first GitHub result", "data_extraction_goal": "Extract the repository name and description", "proxy_location": "NONE", "max_steps_per_run": 15 }' ``` ### 4. Заполнение формы ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://example.com/contact", "navigation_goal": "Fill out contact form with name: John Doe, email: john@example.com, message: Hello", "data_extraction_goal": "Extract confirmation message after submit", "navigation_payload": { "name": "John Doe", "email": "john@example.com", "message": "Hello from Skyvern" }, "proxy_location": "NONE", "max_steps_per_run": 20 }' ``` ## Примеры для e-commerce ### Парсинг товара ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://www.amazon.com/dp/PRODUCT_ID", "navigation_goal": "Extract product information", "data_extraction_goal": "Get product name, price, rating, availability", "extracted_information_schema": { "type": "object", "properties": { "product_name": {"type": "string"}, "price": {"type": "string"}, "rating": {"type": "number"}, "availability": {"type": "string"}, "description": {"type": "string"} } }, "proxy_location": "NONE" }' ``` ### Поиск товаров ```bash curl -X POST http://localhost:8000/api/v1/tasks \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_TOKEN" \ -d '{ "url": "https://www.ebay.com", "navigation_goal": "Search for \"laptop\" and extract first 10 results", "data_extraction_goal": "Extract product titles, prices, and seller ratings", "navigation_payload": { "search_query": "laptop" }, "extracted_information_schema": { "type": "object", "properties": { "products": { "type": "array", "items": { "type": "object", "properties": { "title": {"type": "string"}, "price": {"type": "string"}, "seller_rating": {"type": "number"}, "url": {"type": "string"} } } } } }, "proxy_location": "NONE", "max_steps_per_run": 25 }' ``` ## Проверка статуса задачи ```bash # Получить статус curl http://localhost:8000/api/v1/tasks/TASK_ID \ -H "x-api-key: YOUR_TOKEN" | python3 -m json.tool # Получить скриншоты (если доступны) curl http://localhost:8000/api/v1/tasks/TASK_ID/screenshots \ -H "x-api-key: YOUR_TOKEN" # Получить логи браузера curl http://localhost:8000/api/v1/tasks/TASK_ID/browser_logs \ -H "x-api-key: YOUR_TOKEN" ``` ## Python SDK пример ```python import requests import json import time API_URL = "http://localhost:8000" API_KEY = "YOUR_TOKEN_HERE" def create_task(url, navigation_goal, extraction_goal, schema=None): """Create a Skyvern task.""" headers = { "Content-Type": "application/json", "x-api-key": API_KEY } payload = { "url": url, "navigation_goal": navigation_goal, "data_extraction_goal": extraction_goal, "proxy_location": "NONE" } if schema: payload["extracted_information_schema"] = schema response = requests.post( f"{API_URL}/api/v1/tasks", headers=headers, json=payload ) return response.json() def get_task_status(task_id): """Get task status and results.""" headers = {"x-api-key": API_KEY} response = requests.get( f"{API_URL}/api/v1/tasks/{task_id}", headers=headers ) return response.json() def wait_for_task(task_id, timeout=300, poll_interval=5): """Wait for task to complete.""" start_time = time.time() while time.time() - start_time < timeout: status = get_task_status(task_id) if status["status"] == "completed": return status elif status["status"] == "failed": raise Exception(f"Task failed: {status.get('failure_reason')}") time.sleep(poll_interval) raise TimeoutError(f"Task did not complete within {timeout} seconds") # Example usage if __name__ == "__main__": # Create task task = create_task( url="https://www.python.org", navigation_goal="Extract Python version and features", extraction_goal="Get latest Python version and key features list", schema={ "type": "object", "properties": { "version": {"type": "string"}, "features": { "type": "array", "items": {"type": "string"} } } } ) task_id = task["task_id"] print(f"Created task: {task_id}") # Wait for completion result = wait_for_task(task_id) # Print results print("\nExtracted Information:") print(json.dumps(result["extracted_information"], indent=2)) ``` ## Node.js пример ```javascript const axios = require('axios'); const API_URL = 'http://localhost:8000'; const API_KEY = 'YOUR_TOKEN_HERE'; async function createTask(url, navigationGoal, extractionGoal, schema = null) { try { const response = await axios.post( `${API_URL}/api/v1/tasks`, { url, navigation_goal: navigationGoal, data_extraction_goal: extractionGoal, proxy_location: 'NONE', ...(schema && { extracted_information_schema: schema }) }, { headers: { 'Content-Type': 'application/json', 'x-api-key': API_KEY } } ); return response.data; } catch (error) { console.error('Error creating task:', error.response?.data || error.message); throw error; } } async function getTaskStatus(taskId) { try { const response = await axios.get( `${API_URL}/api/v1/tasks/${taskId}`, { headers: { 'x-api-key': API_KEY } } ); return response.data; } catch (error) { console.error('Error getting task status:', error.response?.data || error.message); throw error; } } async function waitForTask(taskId, timeout = 300000, pollInterval = 5000) { const startTime = Date.now(); while (Date.now() - startTime < timeout) { const status = await getTaskStatus(taskId); if (status.status === 'completed') { return status; } else if (status.status === 'failed') { throw new Error(`Task failed: ${status.failure_reason}`); } await new Promise(resolve => setTimeout(resolve, pollInterval)); } throw new Error(`Task did not complete within ${timeout}ms`); } // Example usage (async () => { try { // Create task const task = await createTask( 'https://news.ycombinator.com', 'Extract top stories', 'Get titles and URLs of top 5 stories', { type: 'object', properties: { stories: { type: 'array', items: { type: 'object', properties: { title: { type: 'string' }, url: { type: 'string' } } } } } } ); console.log('Task created:', task.task_id); // Wait for completion const result = await waitForTask(task.task_id); // Print results console.log('\nExtracted Information:'); console.log(JSON.stringify(result.extracted_information, null, 2)); } catch (error) { console.error('Error:', error.message); } })(); ``` ## n8n интеграция Создайте HTTP Request node в n8n: **Settings:** - Method: `POST` - URL: `http://localhost:8000/api/v1/tasks` - Authentication: `Header Auth` - Name: `x-api-key` - Value: `YOUR_TOKEN` **Body (JSON):** ```json { "url": "{{$json.url}}", "navigation_goal": "{{$json.navigation_goal}}", "data_extraction_goal": "{{$json.extraction_goal}}", "proxy_location": "NONE" } ``` Затем добавьте Wait node и еще один HTTP Request для проверки статуса. ## Best Practices 1. **Используйте `proxy_location: "NONE"`** для использования системного прокси 2. **Всегда указывайте `extracted_information_schema`** для структурированных данных 3. **Установите `max_steps_per_run`** чтобы ограничить количество шагов 4. **Используйте `complete_criterion`** для сложных сценариев 5. **Добавляйте задержки** между запросами при массовом парсинге ## Troubleshooting ### Task fails with "Country not supported" Проверьте что `proxy_location: "NONE"` установлен и `HTTP_PROXY` настроен в `.env`. ### Task timeout Увеличьте `max_steps_per_run` или упростите `navigation_goal`. ### Extraction returns empty data Улучшите `data_extraction_goal` - будьте более конкретны о том, что извлекать. ### Auth required pages Используйте `totp_verification_url` и `totp_identifier` для 2FA/TOTP. --- **Автор**: GitHub Copilot **Проект**: DOROD / Skyvern Integration **Обновлено**: 2026-02-20