Web Scraping

Ingest content from websites into your RAG application using our web scraping capabilities.

Overview

Web scraping in RAGaaS allows you to:

  • Ingest content from multiple URLs
  • Process entire sitemaps
  • Crawl websites with configurable rules
  • Extract content with specific selectors

Getting Started

To use web scraping in RAGaaS:

  1. Sign up for a Firecrawl account

    • Firecrawl is our current web scraping provider
    • More providers coming soon
    • Free tier available for testing
  2. Get your API key from Firecrawl

    • Log in to your Firecrawl account
    • Navigate to API Keys section
    • Create a new API key
  3. Make sure to update your namespace to include your webScraperConfig.

{
  "webScraperConfig": {
    "provider": "FIRECRAWL",
    "apiKey": "your-firecrawl-api-key"
  }
}

Ingestion Methods

RAGaaS provides three methods for web content ingestion:

1. URL List Ingestion

Ingest content from a list of specific URLs:

curl -X POST https://api.ragaas.dev/v1/ingest/urls \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "website",
          "category": "docs"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

2. Sitemap Processing

Ingest all URLs from a sitemap:

curl -X POST https://api.ragaas.dev/v1/ingest/sitemap \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

3. Website Crawling

Crawl a website with custom rules:

curl -X POST https://api.ragaas.dev/v1/ingest/website \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog/posts"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

Configuration Options

Scraping Options

Control which parts of the HTML to extract:

  • includeSelectors: CSS selectors for content to include

    {
      "includeSelectors": ["article", "main", ".content"]
    }
    
  • excludeSelectors: CSS selectors for content to exclude

    {
      "excludeSelectors": [".ads", ".navigation", ".footer"]
    }
    

Website Crawling Options

Configure how the website is crawled:

  • maxDepth: How many links deep to crawl (1-10)
  • maxLinks: Maximum number of URLs to process
  • includePaths: URL paths to include (e.g., ["/docs", "/blog"])
  • excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Chunking Options

Control how content is split:

{
  "chunkConfig": {
    "chunkSize": 1000,
    "chunkOverlap": 100
  }
}

Tracking Progress

All ingestion methods are asynchronous and return an ingestJobRunId:

{
  "success": true,
  "message": "Added your urls ingestion request to the queue",
  "data": {
    "ingestJobRunId": "ijr_abc123"
  }
}

Check ingestion status:

curl https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace \
  -H "Authorization: Bearer $RAGAAS_API_KEY"

Response:

{
  "success": true,
  "message": "Fetched ingest job run details successfully",
  "data": {
    "id": "ijr_abc123",
    "status": "PROCESSING",
    "documents": {
      "queued": [...],
      "processing": [...],
      "completed": [...],
      "failed": [...]
    }
  }
}

Best Practices

Content Selection

  1. Use specific CSS selectors to target main content
  2. Exclude navigation, footers, and ads
  3. Test selectors on sample pages first

URL Management

  1. Start with small URL lists
  2. Use path patterns to focus crawling
  3. Set reasonable maxLinks limits

Resource Usage

  1. Monitor ingestion job status
  2. Use appropriate chunk sizes
  3. Consider rate limits and quotas

Limitations

Current limitations with Firecrawl integration:

  • JavaScript rendering: Supported
  • Cookie handling: Supported
  • Custom headers: Supported
  • Rate limits: Based on Firecrawl plan