Web Scraping

Ingest content from websites into your RAG application using our web scraping capabilities.

Overview

Web scraping in RAGaaS allows you to:

Ingest content from multiple URLs
Process entire sitemaps
Crawl websites with configurable rules
Extract content with specific selectors

Getting Started

To use web scraping in RAGaaS:

Sign up for a Firecrawl account
- Firecrawl is our current web scraping provider
- More providers coming soon
- Free tier available for testing
Get your API key from Firecrawl
- Log in to your Firecrawl account
- Navigate to API Keys section
- Create a new API key
Make sure to update your namespace to include your webScraperConfig.

{
  "webScraperConfig": {
    "provider": "FIRECRAWL",
    "apiKey": "your-firecrawl-api-key"
  }
}

Ingestion Methods

RAGaaS provides three methods for web content ingestion:

1. URL List Ingestion

Ingest content from a list of specific URLs:

curl -X POST https://api.ragaas.dev/v1/ingest/urls \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "website",
          "category": "docs"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

2. Sitemap Processing

Ingest all URLs from a sitemap:

curl -X POST https://api.ragaas.dev/v1/ingest/sitemap \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

3. Website Crawling

Crawl a website with custom rules:

curl -X POST https://api.ragaas.dev/v1/ingest/website \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog/posts"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

Configuration Options

Scraping Options

Control which parts of the HTML to extract:

includeSelectors: CSS selectors for content to include

{
  "includeSelectors": ["article", "main", ".content"]
}

excludeSelectors: CSS selectors for content to exclude

{
  "excludeSelectors": [".ads", ".navigation", ".footer"]
}

Website Crawling Options

Configure how the website is crawled:

maxDepth: How many links deep to crawl (1-10)
maxLinks: Maximum number of URLs to process
includePaths: URL paths to include (e.g., ["/docs", "/blog"])
excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Chunking Options

Control how content is split:

{
  "chunkConfig": {
    "chunkSize": 1000,
    "chunkOverlap": 100
  }
}

Tracking Progress

All ingestion methods are asynchronous and return an ingestJobRunId:

{
  "success": true,
  "message": "Added your urls ingestion request to the queue",
  "data": {
    "ingestJobRunId": "ijr_abc123"
  }
}

Check ingestion status:

curl https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace \
  -H "Authorization: Bearer $RAGAAS_API_KEY"

Response:

{
  "success": true,
  "message": "Fetched ingest job run details successfully",
  "data": {
    "id": "ijr_abc123",
    "status": "PROCESSING",
    "documents": {
      "queued": [...],
      "processing": [...],
      "completed": [...],
      "failed": [...]
    }
  }
}

Best Practices

Content Selection

Use specific CSS selectors to target main content
Exclude navigation, footers, and ads
Test selectors on sample pages first

URL Management

Start with small URL lists
Use path patterns to focus crawling
Set reasonable maxLinks limits

Resource Usage

Monitor ingestion job status
Use appropriate chunk sizes
Consider rate limits and quotas

Limitations

Current limitations with Firecrawl integration:

JavaScript rendering: Supported
Cookie handling: Supported
Custom headers: Supported
Rate limits: Based on Firecrawl plan

Next: Data Ingestion View API Reference