Data Ingestion

RAGaaS provides flexible ways to ingest your documents and content, with real-time status tracking and detailed processing feedback. All content is securely processed before being stored in your own infrastructure.

For detailed API endpoints, request formats, and code examples, see the Data Ingestion API Reference.

Overview

Data ingestion is the first step in building your RAG application. Before ingesting documents:

Create a namespace with your infrastructure configuration (see Namespaces Guide)
Choose an appropriate ingestion method for your content
Prepare any metadata you want to attach to your documents

While data is temporarily processed through our secure infrastructure, it is stored exclusively in your namespace's configured storage and vector database. We do not retain any copies of your data after processing.

Ingestion Methods

RAGaaS offers four main methods for ingesting content:

1. Text Ingestion

Ingest raw text content directly. Perfect for:

API responses
Chat messages
Generated content
Code snippets

Example:

curl -X POST https://api.ragaas.dev/v1/ingest/text \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "TEXT",
      "config": {
        "text": "RAGaaS is a developer-first platform...",
        "metadata": {
          "source": "api",
          "timestamp": "2024-01-15T12:00:00Z"
        }
      }
    }
  }'

2. File Ingestion

Directly upload files from your local machine or cloud storage. Supports:

PDF documents (.pdf)
Text based formats (.csv, .json, .xml, .txt, .md, etc.)
Word documents (.docx, .doc)
Excel documents (.xlsx, .xls)
PowerPoint documents (.pptx, .ppt)
ZIP files (.zip)
And more common document formats

Example:

curl -X POST https://api.ragaas.dev/v1/ingest/file \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F 'namespaceId="ns_abc123"' \
  -F 'file=@"/path/to/document.pdf"' \
  -F 'metadata="{\"title\": \"Sample Document\", \"category\": \"documentation\"}"'

3. Web Content Ingestion

RAGaaS provides three powerful methods for ingesting web content:

URL List: Process specific web pages or documents
Sitemap: Automatically process all pages in a sitemap
Website: Intelligently crawl all pages in a website with custom rules

Web content ingestion requires additional web scraping configuration in your namespace. See our Web Scraping guide for detailed setup instructions and best practices.

Example URL list ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/urls \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/docs/guide.pdf",
          "https://example.com/docs/api.html"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "public_docs",
          "lastUpdated": "2024-01-15"
        }
      }
    }
  }'

For sitemap processing and website crawling examples, see our Web Scraping guide.

4. Notion Ingestion

Connect your Notion workspace to RAGaaS and automatically ingest all your selected pages and databases.

Notion ingestion requires additional Notion configuration in your namespace. See our Notion Connector guide for detailed setup instructions and best practices.

Example Notion ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/notion \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "NOTION",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "public_docs",
          "lastUpdated": "2024-01-15"
        }
      }
    }
  }'

5. Google Drive Ingestion

Connect your Google Drive account to RAGaaS and automatically ingest all your selected files.

Google Drive ingestion requires additional Google Drive configuration in your namespace. See our Google Drive Connector guide for detailed setup instructions and best practices.

Example Google Drive ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/google-drive \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "GOOGLE_DRIVE",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "google_drive",
          "workspace": "My Workspace"
        }
      }
    }
  }'

6. Dropbox Ingestion

Connect your Dropbox account to RAGaaS and automatically ingest all your selected files.

Dropbox ingestion requires additional Dropbox configuration in your namespace. See our Dropbox Connector guide for detailed setup instructions and best practices.

Example Dropbox ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/dropbox \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "DROPBOX",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "dropbox",
          "workspace": "My Workspace"
        }
      }
    }
  }'

7. OneDrive Ingestion

Connect your OneDrive account to RAGaaS and automatically ingest all your selected files.

OneDrive ingestion requires additional OneDrive configuration in your namespace. See our OneDrive Connector guide for detailed setup instructions and best practices.

Example OneDrive ingestion:

curl -X POST https://api.ragaas.dev/v1/ingest/onedrive \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "ingestConfig": {
      "source": "ONEDRIVE",
      "config": {
        "connectionId": "conn_abc123",
        "metadata": {
          "source": "onedrive",
          "workspace": "My Workspace"
        }
      }
    }
  }'

Processing & Chunking

RAGaaS processes your content intelligently using the configuration from your namespace:

Document Processing

Content is extracted from the source
Text is split into chunks based on your namespace's embedding model limits
Chunks are embedded using your configured embedding model
Embeddings are stored in your vector database
Original content is stored in your file storage

Metadata

You can attach custom metadata to your documents during ingestion. This metadata can be used for:

Filtering during search
Content organization
Version tracking
Source identification

Example metadata uses:

{
  "department": "engineering",
  "docType": "technical",
  "language": "en",
  "version": "2.0",
  "source": "knowledge-base",
  "lastUpdated": "2024-01-15T12:00:00Z"
}

Processing Flow

All ingestion requests are processed asynchronously:

Request accepted (QUEUED) - You receive an ingestJobRunId
Initial setup (PRE_PROCESSING) - Validating configuration and setup
Content processing (PROCESSING) - Documents are being processed
Final status (COMPLETED) - All documents have been processed

Content Chunking

Your content is automatically chunked based on your namespace's embedding model:

OpenAI Models

Default chunk size: 1000 tokens (good balance between context and specificity)
Default overlap: 100 tokens (10% overlap for context continuity)
Maximum chunk size: 8000 tokens

Cohere Models

Default chunk size: 350 tokens (optimized for model limits)
Default overlap: 50 tokens (15% overlap for context preservation)
Maximum chunk size: 450 tokens

Jina Models

Default chunk size: 1000 tokens
Default overlap: 100 tokens
Maximum chunk size: 8000 tokens

You can override these defaults by providing a chunkConfig in your request:

{
  "chunkConfig": {
    "chunkSize": 500,    # Smaller chunks for more precise retrieval
    "chunkOverlap": 50   # Must be less than chunkSize
  }
}

Processing Status

For asynchronous ingestion requests (URLs, sitemaps, websites), monitor the status using the ingestJobRunId:

curl -X GET "https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=ns_abc123" \
  -H "Authorization: Bearer $RAGAAS_API_KEY"

The status response includes:

Overall job status:

QUEUED: Request is waiting to be processed
PRE_PROCESSING: Initial setup and validation
PROCESSING: Documents are being processed
COMPLETED: All documents have been processed (successfully or with failures)

Individual document statuses:

{
  "id": "ijr_abc123",
  "status": "PROCESSING",
  "documents": {
    "queued": [{ "id": "doc_1", "status": "QUEUED", "error": null }],
    "processing": [{ "id": "doc_2", "status": "PROCESSING", "error": null }],
    "completed": [{ "id": "doc_3", "status": "SUCCESS", "error": null }],
    "failed": [
      {
        "id": "doc_4",
        "status": "FAILED",
        "error": "File format not supported"
      }
    ]
  }
}

For large documents or websites, processing happens asynchronously. Monitor the status using the ingestJobRunId from the response to track both overall job progress and individual document statuses.

Next: Search View API Reference