Data Ingestion

RAGaaS provides flexible ways to ingest your documents and content, with real-time status tracking and detailed processing feedback. All content is securely processed before being stored in your own infrastructure.

Overview

Data ingestion is the first step in building your RAG application. Before ingesting documents:

  1. Create a namespace with your infrastructure configuration (see Namespaces Guide)
  2. Choose an appropriate ingestion method for your content
  3. Prepare any metadata you want to attach to your documents

Ingestion Methods

RAGaaS offers four main methods for ingesting content:

1. Text Ingestion

Ingest raw text content directly. Perfect for:

  • API responses
  • Chat messages
  • Generated content
  • Code snippets

Example:

curl -X POST https://api.ragaas.dev/v1/ingest/text \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "TEXT",
      "config": {
        "text": "RAGaaS is a developer-first platform...",
        "metadata": {
          "source": "api",
          "timestamp": "2024-01-15T12:00:00Z"
        }
      }
    }
  }'

2. File Ingestion

Directly upload files from your local machine or cloud storage. Supports:

  • PDF documents (.pdf)
  • Text based formats (.csv, .json, .xml, .txt, .md, etc.)
  • Word documents (.docx, .doc)
  • Excel documents (.xlsx, .xls)
  • PowerPoint documents (.pptx, .ppt)
  • ZIP files (.zip)
  • And more common document formats

Example:

curl -X POST https://api.ragaas.dev/v1/ingest/file \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F 'namespaceId="ns_abc123"' \
  -F 'file=@"/path/to/document.pdf"' \
  -F 'metadata="{\"title\": \"Sample Document\", \"category\": \"documentation\"}"'

3. Web Content Ingestion

RAGaaS provides three powerful methods for ingesting web content:

  • URL List: Process specific web pages or documents
  • Sitemap: Automatically process all pages in a sitemap
  • Website: Intelligently crawl all pages in a website with custom rules

Example URL list ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/urls \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/docs/guide.pdf",
          "https://example.com/docs/api.html"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "public_docs",
          "lastUpdated": "2024-01-15"
        }
      }
    }
  }'

4. Notion Ingestion

Connect your Notion workspace to RAGaaS and automatically ingest all your selected pages and databases.

Example Notion ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/notion \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "NOTION",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "public_docs",
          "lastUpdated": "2024-01-15"
        }
      }
    }
  }'

5. Google Drive Ingestion

Connect your Google Drive account to RAGaaS and automatically ingest all your selected files.

Example Google Drive ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/google-drive \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "GOOGLE_DRIVE",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "google_drive",
          "workspace": "My Workspace"
        }
      }
    }
  }'

6. Dropbox Ingestion

Connect your Dropbox account to RAGaaS and automatically ingest all your selected files.

Example Dropbox ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/dropbox \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "DROPBOX",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "dropbox",
          "workspace": "My Workspace"
        }
      }
    }
  }'

7. OneDrive Ingestion

Connect your OneDrive account to RAGaaS and automatically ingest all your selected files.

Example OneDrive ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/onedrive \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "ONEDRIVE",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "onedrive",
          "workspace": "My Workspace"
        }
      }
    }
  }'

Processing & Chunking

RAGaaS processes your content intelligently using the configuration from your namespace:

Document Processing

  1. Content is extracted from the source
  2. Text is split into chunks based on your namespace's embedding model limits
  3. Chunks are embedded using your configured embedding model
  4. Embeddings are stored in your vector database
  5. Original content is stored in your file storage

Metadata

You can attach custom metadata to your documents during ingestion. This metadata can be used for:

  • Filtering during search
  • Content organization
  • Version tracking
  • Source identification

Example metadata uses:

{
  "department": "engineering",
  "docType": "technical",
  "language": "en",
  "version": "2.0",
  "source": "knowledge-base",
  "lastUpdated": "2024-01-15T12:00:00Z"
}

Processing Flow

All ingestion requests are processed asynchronously:

  1. Request accepted (QUEUED) - You receive an ingestJobRunId
  2. Initial setup (PRE_PROCESSING) - Validating configuration and setup
  3. Content processing (PROCESSING) - Documents are being processed
  4. Final status (COMPLETED) - All documents have been processed

Content Chunking

Your content is automatically chunked based on your namespace's embedding model:

OpenAI Models

  • Default chunk size: 1000 tokens (good balance between context and specificity)
  • Default overlap: 100 tokens (10% overlap for context continuity)
  • Maximum chunk size: 8000 tokens

Cohere Models

  • Default chunk size: 350 tokens (optimized for model limits)
  • Default overlap: 50 tokens (15% overlap for context preservation)
  • Maximum chunk size: 450 tokens

Jina Models

  • Default chunk size: 1000 tokens
  • Default overlap: 100 tokens
  • Maximum chunk size: 8000 tokens

You can override these defaults by providing a chunkConfig in your request:

{
  "chunkConfig": {
    "chunkSize": 500,    # Smaller chunks for more precise retrieval
    "chunkOverlap": 50   # Must be less than chunkSize
  }
}

Processing Status

For asynchronous ingestion requests (URLs, sitemaps, websites), monitor the status using the ingestJobRunId:

curl -X GET "https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=ns_abc123" \
  -H "Authorization: Bearer $RAGAAS_API_KEY"

The status response includes:

  1. Overall job status:
  • QUEUED: Request is waiting to be processed
  • PRE_PROCESSING: Initial setup and validation
  • PROCESSING: Documents are being processed
  • COMPLETED: All documents have been processed (successfully or with failures)
  1. Individual document statuses:
{
  "id": "ijr_abc123",
  "status": "PROCESSING",
  "documents": {
    "queued": [{ "id": "doc_1", "status": "QUEUED", "error": null }],
    "processing": [{ "id": "doc_2", "status": "PROCESSING", "error": null }],
    "completed": [{ "id": "doc_3", "status": "SUCCESS", "error": null }],
    "failed": [
      {
        "id": "doc_4",
        "status": "FAILED",
        "error": "File format not supported"
      }
    ]
  }
}