412 lines
11 KiB
Markdown
412 lines
11 KiB
Markdown
|
|
# Skyvern Parsing Examples
|
|||
|
|
|
|||
|
|
Примеры использования Skyvern для парсинга различных сайтов.
|
|||
|
|
|
|||
|
|
## Базовые команды
|
|||
|
|
|
|||
|
|
### 1. Простое извлечение текста
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://example.com",
|
|||
|
|
"navigation_goal": "Navigate to the page and extract heading",
|
|||
|
|
"data_extraction_goal": "Extract the main h1 heading",
|
|||
|
|
"proxy_location": "NONE"
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Извлечение структурированных данных
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://news.ycombinator.com",
|
|||
|
|
"navigation_goal": "Extract top stories from Hacker News",
|
|||
|
|
"data_extraction_goal": "Extract titles and URLs of top 5 stories",
|
|||
|
|
"extracted_information_schema": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"stories": {
|
|||
|
|
"type": "array",
|
|||
|
|
"items": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"title": {"type": "string"},
|
|||
|
|
"url": {"type": "string"},
|
|||
|
|
"points": {"type": "number"}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
"proxy_location": "NONE",
|
|||
|
|
"max_steps_per_run": 10
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. Поиск и клик
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://www.google.com/search?q=skyvern+github",
|
|||
|
|
"navigation_goal": "Click on the first GitHub result",
|
|||
|
|
"data_extraction_goal": "Extract the repository name and description",
|
|||
|
|
"proxy_location": "NONE",
|
|||
|
|
"max_steps_per_run": 15
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. Заполнение формы
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://example.com/contact",
|
|||
|
|
"navigation_goal": "Fill out contact form with name: John Doe, email: john@example.com, message: Hello",
|
|||
|
|
"data_extraction_goal": "Extract confirmation message after submit",
|
|||
|
|
"navigation_payload": {
|
|||
|
|
"name": "John Doe",
|
|||
|
|
"email": "john@example.com",
|
|||
|
|
"message": "Hello from Skyvern"
|
|||
|
|
},
|
|||
|
|
"proxy_location": "NONE",
|
|||
|
|
"max_steps_per_run": 20
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Примеры для e-commerce
|
|||
|
|
|
|||
|
|
### Парсинг товара
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://www.amazon.com/dp/PRODUCT_ID",
|
|||
|
|
"navigation_goal": "Extract product information",
|
|||
|
|
"data_extraction_goal": "Get product name, price, rating, availability",
|
|||
|
|
"extracted_information_schema": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"product_name": {"type": "string"},
|
|||
|
|
"price": {"type": "string"},
|
|||
|
|
"rating": {"type": "number"},
|
|||
|
|
"availability": {"type": "string"},
|
|||
|
|
"description": {"type": "string"}
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
"proxy_location": "NONE"
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Поиск товаров
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/api/v1/tasks \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" \
|
|||
|
|
-d '{
|
|||
|
|
"url": "https://www.ebay.com",
|
|||
|
|
"navigation_goal": "Search for \"laptop\" and extract first 10 results",
|
|||
|
|
"data_extraction_goal": "Extract product titles, prices, and seller ratings",
|
|||
|
|
"navigation_payload": {
|
|||
|
|
"search_query": "laptop"
|
|||
|
|
},
|
|||
|
|
"extracted_information_schema": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"products": {
|
|||
|
|
"type": "array",
|
|||
|
|
"items": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"title": {"type": "string"},
|
|||
|
|
"price": {"type": "string"},
|
|||
|
|
"seller_rating": {"type": "number"},
|
|||
|
|
"url": {"type": "string"}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
},
|
|||
|
|
"proxy_location": "NONE",
|
|||
|
|
"max_steps_per_run": 25
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Проверка статуса задачи
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Получить статус
|
|||
|
|
curl http://localhost:8000/api/v1/tasks/TASK_ID \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN" | python3 -m json.tool
|
|||
|
|
|
|||
|
|
# Получить скриншоты (если доступны)
|
|||
|
|
curl http://localhost:8000/api/v1/tasks/TASK_ID/screenshots \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN"
|
|||
|
|
|
|||
|
|
# Получить логи браузера
|
|||
|
|
curl http://localhost:8000/api/v1/tasks/TASK_ID/browser_logs \
|
|||
|
|
-H "x-api-key: YOUR_TOKEN"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Python SDK пример
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import requests
|
|||
|
|
import json
|
|||
|
|
import time
|
|||
|
|
|
|||
|
|
API_URL = "http://localhost:8000"
|
|||
|
|
API_KEY = "YOUR_TOKEN_HERE"
|
|||
|
|
|
|||
|
|
def create_task(url, navigation_goal, extraction_goal, schema=None):
|
|||
|
|
"""Create a Skyvern task."""
|
|||
|
|
headers = {
|
|||
|
|
"Content-Type": "application/json",
|
|||
|
|
"x-api-key": API_KEY
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
payload = {
|
|||
|
|
"url": url,
|
|||
|
|
"navigation_goal": navigation_goal,
|
|||
|
|
"data_extraction_goal": extraction_goal,
|
|||
|
|
"proxy_location": "NONE"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if schema:
|
|||
|
|
payload["extracted_information_schema"] = schema
|
|||
|
|
|
|||
|
|
response = requests.post(
|
|||
|
|
f"{API_URL}/api/v1/tasks",
|
|||
|
|
headers=headers,
|
|||
|
|
json=payload
|
|||
|
|
)
|
|||
|
|
return response.json()
|
|||
|
|
|
|||
|
|
def get_task_status(task_id):
|
|||
|
|
"""Get task status and results."""
|
|||
|
|
headers = {"x-api-key": API_KEY}
|
|||
|
|
response = requests.get(
|
|||
|
|
f"{API_URL}/api/v1/tasks/{task_id}",
|
|||
|
|
headers=headers
|
|||
|
|
)
|
|||
|
|
return response.json()
|
|||
|
|
|
|||
|
|
def wait_for_task(task_id, timeout=300, poll_interval=5):
|
|||
|
|
"""Wait for task to complete."""
|
|||
|
|
start_time = time.time()
|
|||
|
|
|
|||
|
|
while time.time() - start_time < timeout:
|
|||
|
|
status = get_task_status(task_id)
|
|||
|
|
|
|||
|
|
if status["status"] == "completed":
|
|||
|
|
return status
|
|||
|
|
elif status["status"] == "failed":
|
|||
|
|
raise Exception(f"Task failed: {status.get('failure_reason')}")
|
|||
|
|
|
|||
|
|
time.sleep(poll_interval)
|
|||
|
|
|
|||
|
|
raise TimeoutError(f"Task did not complete within {timeout} seconds")
|
|||
|
|
|
|||
|
|
# Example usage
|
|||
|
|
if __name__ == "__main__":
|
|||
|
|
# Create task
|
|||
|
|
task = create_task(
|
|||
|
|
url="https://www.python.org",
|
|||
|
|
navigation_goal="Extract Python version and features",
|
|||
|
|
extraction_goal="Get latest Python version and key features list",
|
|||
|
|
schema={
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"version": {"type": "string"},
|
|||
|
|
"features": {
|
|||
|
|
"type": "array",
|
|||
|
|
"items": {"type": "string"}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
task_id = task["task_id"]
|
|||
|
|
print(f"Created task: {task_id}")
|
|||
|
|
|
|||
|
|
# Wait for completion
|
|||
|
|
result = wait_for_task(task_id)
|
|||
|
|
|
|||
|
|
# Print results
|
|||
|
|
print("\nExtracted Information:")
|
|||
|
|
print(json.dumps(result["extracted_information"], indent=2))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Node.js пример
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
const axios = require('axios');
|
|||
|
|
|
|||
|
|
const API_URL = 'http://localhost:8000';
|
|||
|
|
const API_KEY = 'YOUR_TOKEN_HERE';
|
|||
|
|
|
|||
|
|
async function createTask(url, navigationGoal, extractionGoal, schema = null) {
|
|||
|
|
try {
|
|||
|
|
const response = await axios.post(
|
|||
|
|
`${API_URL}/api/v1/tasks`,
|
|||
|
|
{
|
|||
|
|
url,
|
|||
|
|
navigation_goal: navigationGoal,
|
|||
|
|
data_extraction_goal: extractionGoal,
|
|||
|
|
proxy_location: 'NONE',
|
|||
|
|
...(schema && { extracted_information_schema: schema })
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
headers: {
|
|||
|
|
'Content-Type': 'application/json',
|
|||
|
|
'x-api-key': API_KEY
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
);
|
|||
|
|
return response.data;
|
|||
|
|
} catch (error) {
|
|||
|
|
console.error('Error creating task:', error.response?.data || error.message);
|
|||
|
|
throw error;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
async function getTaskStatus(taskId) {
|
|||
|
|
try {
|
|||
|
|
const response = await axios.get(
|
|||
|
|
`${API_URL}/api/v1/tasks/${taskId}`,
|
|||
|
|
{
|
|||
|
|
headers: { 'x-api-key': API_KEY }
|
|||
|
|
}
|
|||
|
|
);
|
|||
|
|
return response.data;
|
|||
|
|
} catch (error) {
|
|||
|
|
console.error('Error getting task status:', error.response?.data || error.message);
|
|||
|
|
throw error;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
async function waitForTask(taskId, timeout = 300000, pollInterval = 5000) {
|
|||
|
|
const startTime = Date.now();
|
|||
|
|
|
|||
|
|
while (Date.now() - startTime < timeout) {
|
|||
|
|
const status = await getTaskStatus(taskId);
|
|||
|
|
|
|||
|
|
if (status.status === 'completed') {
|
|||
|
|
return status;
|
|||
|
|
} else if (status.status === 'failed') {
|
|||
|
|
throw new Error(`Task failed: ${status.failure_reason}`);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
await new Promise(resolve => setTimeout(resolve, pollInterval));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
throw new Error(`Task did not complete within ${timeout}ms`);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Example usage
|
|||
|
|
(async () => {
|
|||
|
|
try {
|
|||
|
|
// Create task
|
|||
|
|
const task = await createTask(
|
|||
|
|
'https://news.ycombinator.com',
|
|||
|
|
'Extract top stories',
|
|||
|
|
'Get titles and URLs of top 5 stories',
|
|||
|
|
{
|
|||
|
|
type: 'object',
|
|||
|
|
properties: {
|
|||
|
|
stories: {
|
|||
|
|
type: 'array',
|
|||
|
|
items: {
|
|||
|
|
type: 'object',
|
|||
|
|
properties: {
|
|||
|
|
title: { type: 'string' },
|
|||
|
|
url: { type: 'string' }
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
console.log('Task created:', task.task_id);
|
|||
|
|
|
|||
|
|
// Wait for completion
|
|||
|
|
const result = await waitForTask(task.task_id);
|
|||
|
|
|
|||
|
|
// Print results
|
|||
|
|
console.log('\nExtracted Information:');
|
|||
|
|
console.log(JSON.stringify(result.extracted_information, null, 2));
|
|||
|
|
} catch (error) {
|
|||
|
|
console.error('Error:', error.message);
|
|||
|
|
}
|
|||
|
|
})();
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## n8n интеграция
|
|||
|
|
|
|||
|
|
Создайте HTTP Request node в n8n:
|
|||
|
|
|
|||
|
|
**Settings:**
|
|||
|
|
- Method: `POST`
|
|||
|
|
- URL: `http://localhost:8000/api/v1/tasks`
|
|||
|
|
- Authentication: `Header Auth`
|
|||
|
|
- Name: `x-api-key`
|
|||
|
|
- Value: `YOUR_TOKEN`
|
|||
|
|
|
|||
|
|
**Body (JSON):**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"url": "{{$json.url}}",
|
|||
|
|
"navigation_goal": "{{$json.navigation_goal}}",
|
|||
|
|
"data_extraction_goal": "{{$json.extraction_goal}}",
|
|||
|
|
"proxy_location": "NONE"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Затем добавьте Wait node и еще один HTTP Request для проверки статуса.
|
|||
|
|
|
|||
|
|
## Best Practices
|
|||
|
|
|
|||
|
|
1. **Используйте `proxy_location: "NONE"`** для использования системного прокси
|
|||
|
|
2. **Всегда указывайте `extracted_information_schema`** для структурированных данных
|
|||
|
|
3. **Установите `max_steps_per_run`** чтобы ограничить количество шагов
|
|||
|
|
4. **Используйте `complete_criterion`** для сложных сценариев
|
|||
|
|
5. **Добавляйте задержки** между запросами при массовом парсинге
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### Task fails with "Country not supported"
|
|||
|
|
|
|||
|
|
Проверьте что `proxy_location: "NONE"` установлен и `HTTP_PROXY` настроен в `.env`.
|
|||
|
|
|
|||
|
|
### Task timeout
|
|||
|
|
|
|||
|
|
Увеличьте `max_steps_per_run` или упростите `navigation_goal`.
|
|||
|
|
|
|||
|
|
### Extraction returns empty data
|
|||
|
|
|
|||
|
|
Улучшите `data_extraction_goal` - будьте более конкретны о том, что извлекать.
|
|||
|
|
|
|||
|
|
### Auth required pages
|
|||
|
|
|
|||
|
|
Используйте `totp_verification_url` и `totp_identifier` для 2FA/TOTP.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Автор**: GitHub Copilot
|
|||
|
|
**Проект**: DOROD / Skyvern Integration
|
|||
|
|
**Обновлено**: 2026-02-20
|