Web Scraping
Ingest content from websites into your RAG application using our web scraping capabilities.
Overview
Web scraping in RAGaaS allows you to:
- Ingest content from multiple URLs
- Process entire sitemaps
- Crawl websites with configurable rules
- Extract content with specific selectors
Getting Started
To use web scraping in RAGaaS:
-
Sign up for a Firecrawl account
- Firecrawl is our current web scraping provider
- More providers coming soon
- Free tier available for testing
-
Get your API key from Firecrawl
- Log in to your Firecrawl account
- Navigate to API Keys section
- Create a new API key
-
Make sure to update your namespace to include your webScraperConfig.
{
"webScraperConfig": {
"provider": "FIRECRAWL",
"apiKey": "your-firecrawl-api-key"
}
}
Ingestion Methods
RAGaaS provides three methods for web content ingestion:
1. URL List Ingestion
Ingest content from a list of specific URLs:
curl -X POST https://api.ragaas.dev/v1/ingest/urls \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "your-namespace",
"ingestConfig": {
"source": "URLS_LIST",
"config": {
"urls": [
"https://example.com/page1",
"https://example.com/page2"
],
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
},
"metadata": {
"source": "website",
"category": "docs"
}
},
"chunkConfig": {
"chunkSize": 1000,
"chunkOverlap": 100
}
}
}'
2. Sitemap Processing
Ingest all URLs from a sitemap:
curl -X POST https://api.ragaas.dev/v1/ingest/sitemap \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "your-namespace",
"ingestConfig": {
"source": "SITEMAP",
"config": {
"url": "https://example.com/sitemap.xml",
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
}
},
"chunkConfig": {
"chunkSize": 1000,
"chunkOverlap": 100
}
}
}'
3. Website Crawling
Crawl a website with custom rules:
curl -X POST https://api.ragaas.dev/v1/ingest/website \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "your-namespace",
"ingestConfig": {
"source": "WEBSITE",
"config": {
"url": "https://example.com",
"maxDepth": 3,
"maxLinks": 100,
"includePaths": ["/docs", "/blog/posts"],
"excludePaths": ["/admin"],
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
}
},
"chunkConfig": {
"chunkSize": 1000,
"chunkOverlap": 100
}
}
}'
Configuration Options
Scraping Options
Control which parts of the HTML to extract:
-
includeSelectors: CSS selectors for content to include
{ "includeSelectors": ["article", "main", ".content"] }
-
excludeSelectors: CSS selectors for content to exclude
{ "excludeSelectors": [".ads", ".navigation", ".footer"] }
Website Crawling Options
Configure how the website is crawled:
- maxDepth: How many links deep to crawl (1-10)
- maxLinks: Maximum number of URLs to process
- includePaths: URL paths to include (e.g., ["/docs", "/blog"])
- excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])
Chunking Options
Control how content is split:
{
"chunkConfig": {
"chunkSize": 1000,
"chunkOverlap": 100
}
}
Tracking Progress
All ingestion methods are asynchronous and return an ingestJobRunId
:
{
"success": true,
"message": "Added your urls ingestion request to the queue",
"data": {
"ingestJobRunId": "ijr_abc123"
}
}
Check ingestion status:
curl https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace \
-H "Authorization: Bearer $RAGAAS_API_KEY"
Response:
{
"success": true,
"message": "Fetched ingest job run details successfully",
"data": {
"id": "ijr_abc123",
"status": "PROCESSING",
"documents": {
"queued": [...],
"processing": [...],
"completed": [...],
"failed": [...]
}
}
}
Best Practices
Content Selection
- Use specific CSS selectors to target main content
- Exclude navigation, footers, and ads
- Test selectors on sample pages first
URL Management
- Start with small URL lists
- Use path patterns to focus crawling
- Set reasonable maxLinks limits
Resource Usage
- Monitor ingestion job status
- Use appropriate chunk sizes
- Consider rate limits and quotas
Limitations
Current limitations with Firecrawl integration:
- JavaScript rendering: Supported
- Cookie handling: Supported
- Custom headers: Supported
- Rate limits: Based on Firecrawl plan