Web Scraping

Ingest content from websites into your RAG application using our web scraping capabilities.

Overview

Web scraping in RAGaaS allows you to:

  • Ingest content from multiple URLs
  • Process entire sitemaps
  • Crawl websites with configurable rules
  • Extract content with specific selectors

Getting Started

To use web scraping in RAGaaS, you need to configure a web scraper provider. We support three options:

  1. Firecrawl

    • Advanced web scraping provider
    • Free tier available for testing
  2. Jina Reader API

    • Free web scraping alternative
    • Simple to get started
  3. ScrapingBee

    • Reliable web scraping with JavaScript rendering
    • Free tier available for testing

Choose your preferred provider and update your namespace configuration with the appropriate API key:

# The "provider" field can be:
# - "FIRECRAWL"
# - "JINA"
# - "SCRAPINGBEE"

curl -X PATCH https://api.ragaas.dev/v1/namespaces/ns_123 \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "webScraperConfig": {
      "provider": "FIRECRAWL",
      "apiKey": "new-api-key"
    }
  }'

Ingestion Methods

RAGaaS provides three methods for web content ingestion:

1. URL List Ingestion

Ingest content from a list of specific URLs:

curl -X POST https://api.ragaas.dev/v1/ingest/urls \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "website",
          "category": "docs"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

2. Sitemap Processing

Ingest all URLs from a sitemap:

curl -X POST https://api.ragaas.dev/v1/ingest/sitemap \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "maxLinks": 1000,
        "includePaths": ["/docs"],
        "excludePaths": ["/docs/internal"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

3. Website Crawling

Crawl a website with custom rules:

curl -X POST https://api.ragaas.dev/v1/ingest/website \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog/posts"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

Configuration Options

Scraping Options

Control which parts of the HTML to extract:

  • includeSelectors: CSS selectors for content to include

    {
      "includeSelectors": ["article", "main", ".content"]
    }
    
  • excludeSelectors: CSS selectors for content to exclude

    {
      "excludeSelectors": [".ads", ".navigation", ".footer"]
    }
    

Sitemap Processing Options

Configure how the sitemap is processed:

  • maxLinks: Maximum number of URLs to process
  • includePaths: URL paths to include (e.g., ["/docs", "/blog"])
  • excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Website Crawling Options

Configure how the website is crawled:

  • maxDepth: How many links deep to crawl (1-10)
  • maxLinks: Maximum number of URLs to process
  • includePaths: URL paths to include (e.g., ["/docs", "/blog"])
  • excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Chunking Options

Control how content is split:

{
  "chunkConfig": {
    "chunkSize": 1000,
    "chunkOverlap": 100
  }
}

Tracking Progress

All ingestion methods are asynchronous and return an ingestJobRunId:

{
  "success": true,
  "message": "Added your urls ingestion request to the queue",
  "data": {
    "ingestJobRunId": "ijr_abc123"
  }
}

Check ingestion status:

curl https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace-identifier \
  -H "Authorization: Bearer $RAGAAS_API_KEY"

Response:

{
  "success": true,
  "message": "Fetched ingest job run details successfully",
  "data": {
    "id": "ijr_abc123",
    "status": "PROCESSING",
    "documents": {
      "queued": [...],
      "processing": [...],
      "completed": [...],
      "failed": [...]
    }
  }
}

Best Practices

Content Selection

  1. Use specific CSS selectors to target main content
  2. Exclude navigation, footers, and ads
  3. Test selectors on sample pages first

URL Management

  1. Start with small URL lists
  2. Use path patterns to focus crawling
  3. Set reasonable maxLinks limits

Resource Usage

  1. Monitor ingestion job status
  2. Use appropriate chunk sizes
  3. Consider rate limits and quotas