Data Ingestion
RAGaaS provides flexible ways to ingest your documents and content, with real-time status tracking and detailed processing feedback. All content is securely processed before being stored in your own infrastructure.
For detailed API endpoints, request formats, and code examples, see the Data Ingestion API Reference.
Overview
Data ingestion is the first step in building your RAG application. Before ingesting documents:
- Create a namespace with your infrastructure configuration (see Namespaces Guide)
- Choose an appropriate ingestion method for your content
- Prepare any metadata you want to attach to your documents
While data is temporarily processed through our secure infrastructure, it is stored exclusively in your namespace's configured storage and vector database. We do not retain any copies of your data after processing.
Ingestion Methods
RAGaaS offers four main methods for ingesting content:
1. Text Ingestion
Ingest raw text content directly. Perfect for:
- API responses
- Chat messages
- Generated content
- Code snippets
Example:
curl -X POST https://api.ragaas.dev/v1/ingest/text \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "TEXT",
"config": {
"text": "RAGaaS is a developer-first platform...",
"metadata": {
"source": "api",
"timestamp": "2024-01-15T12:00:00Z"
}
}
}
}'
2. File Ingestion
Directly upload files from your local machine or cloud storage. Supports:
- PDF documents (
.pdf
) - Text based formats (
.csv
,.json
,.xml
,.txt
,.md
, etc.) - Word documents (
.docx
,.doc
) - Excel documents (
.xlsx
,.xls
) - PowerPoint documents (
.pptx
,.ppt
) - ZIP files (
.zip
) - And more common document formats
Example:
curl -X POST https://api.ragaas.dev/v1/ingest/file \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F 'namespaceId="ns_abc123"' \
-F 'file=@"/path/to/document.pdf"' \
-F 'metadata="{\"title\": \"Sample Document\", \"category\": \"documentation\"}"'
3. Web Content Ingestion
RAGaaS provides three powerful methods for ingesting web content:
- URL List: Process specific web pages or documents
- Sitemap: Automatically process all pages in a sitemap
- Website: Intelligently crawl all pages in a website with custom rules
Web content ingestion requires additional web scraping configuration in your namespace. See our Web Scraping guide for detailed setup instructions and best practices.
Example URL list ingestion:
curl -X POST https://api.ragaas.dev/v1/ingest/urls \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "URLS_LIST",
"config": {
"urls": [
"https://example.com/docs/guide.pdf",
"https://example.com/docs/api.html"
],
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
},
"metadata": {
"source": "public_docs",
"lastUpdated": "2024-01-15"
}
}
}
}'
For sitemap processing and website crawling examples, see our Web Scraping guide.
4. Notion Ingestion
Connect your Notion workspace to RAGaaS and automatically ingest all your selected pages and databases.
Notion ingestion requires additional Notion configuration in your namespace. See our Notion Connector guide for detailed setup instructions and best practices.
Example Notion ingestion:
curl -X POST https://api.ragaas.dev/v1/ingest/notion \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "NOTION",
"config": {
"connectionId": "conn_abc123",
"metadata": {
"source": "public_docs",
"lastUpdated": "2024-01-15"
}
}
}
}'
5. Google Drive Ingestion
Connect your Google Drive account to RAGaaS and automatically ingest all your selected files.
Google Drive ingestion requires additional Google Drive configuration in your namespace. See our Google Drive Connector guide for detailed setup instructions and best practices.
Example Google Drive ingestion:
curl -X POST https://api.ragaas.dev/v1/ingest/google-drive \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "GOOGLE_DRIVE",
"config": {
"connectionId": "conn_abc123",
"metadata": {
"source": "google_drive",
"workspace": "My Workspace"
}
}
}
}'
6. Dropbox Ingestion
Connect your Dropbox account to RAGaaS and automatically ingest all your selected files.
Dropbox ingestion requires additional Dropbox configuration in your namespace. See our Dropbox Connector guide for detailed setup instructions and best practices.
Example Dropbox ingestion:
curl -X POST https://api.ragaas.dev/v1/ingest/dropbox \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "DROPBOX",
"config": {
"connectionId": "conn_abc123",
"metadata": {
"source": "dropbox",
"workspace": "My Workspace"
}
}
}
}'
7. OneDrive Ingestion
Connect your OneDrive account to RAGaaS and automatically ingest all your selected files.
OneDrive ingestion requires additional OneDrive configuration in your namespace. See our OneDrive Connector guide for detailed setup instructions and best practices.
Example OneDrive ingestion:
curl -X POST https://api.ragaas.dev/v1/ingest/onedrive \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"ingestConfig": {
"source": "ONEDRIVE",
"config": {
"connectionId": "conn_abc123",
"metadata": {
"source": "onedrive",
"workspace": "My Workspace"
}
}
}
}'
Processing & Chunking
RAGaaS processes your content intelligently using the configuration from your namespace:
Document Processing
- Content is extracted from the source
- Text is split into chunks based on your namespace's embedding model limits
- Chunks are embedded using your configured embedding model
- Embeddings are stored in your vector database
- Original content is stored in your file storage
Metadata
You can attach custom metadata to your documents during ingestion. This metadata can be used for:
- Filtering during search
- Content organization
- Version tracking
- Source identification
Example metadata uses:
{
"department": "engineering",
"docType": "technical",
"language": "en",
"version": "2.0",
"source": "knowledge-base",
"lastUpdated": "2024-01-15T12:00:00Z"
}
Processing Flow
All ingestion requests are processed asynchronously:
- Request accepted (
QUEUED
) - You receive aningestJobRunId
- Initial setup (
PRE_PROCESSING
) - Validating configuration and setup - Content processing (
PROCESSING
) - Documents are being processed - Final status (
COMPLETED
) - All documents have been processed
Content Chunking
Your content is automatically chunked based on your namespace's embedding model:
OpenAI Models
- Default chunk size: 1000 tokens (good balance between context and specificity)
- Default overlap: 100 tokens (10% overlap for context continuity)
- Maximum chunk size: 8000 tokens
Cohere Models
- Default chunk size: 350 tokens (optimized for model limits)
- Default overlap: 50 tokens (15% overlap for context preservation)
- Maximum chunk size: 450 tokens
Jina Models
- Default chunk size: 1000 tokens
- Default overlap: 100 tokens
- Maximum chunk size: 8000 tokens
You can override these defaults by providing a chunkConfig
in your request:
{
"chunkConfig": {
"chunkSize": 500, # Smaller chunks for more precise retrieval
"chunkOverlap": 50 # Must be less than chunkSize
}
}
Processing Status
For asynchronous ingestion requests (URLs, sitemaps, websites), monitor the status using the ingestJobRunId
:
curl -X GET "https://api.ragaas.dev/v1/ingest-job-runs/ijr_abc123?namespaceId=ns_abc123" \
-H "Authorization: Bearer $RAGAAS_API_KEY"
The status response includes:
- Overall job status:
QUEUED
: Request is waiting to be processedPRE_PROCESSING
: Initial setup and validationPROCESSING
: Documents are being processedCOMPLETED
: All documents have been processed (successfully or with failures)
- Individual document statuses:
{
"id": "ijr_abc123",
"status": "PROCESSING",
"documents": {
"queued": [{ "id": "doc_1", "status": "QUEUED", "error": null }],
"processing": [{ "id": "doc_2", "status": "PROCESSING", "error": null }],
"completed": [{ "id": "doc_3", "status": "SUCCESS", "error": null }],
"failed": [
{
"id": "doc_4",
"status": "FAILED",
"error": "File format not supported"
}
]
}
}
For large documents or websites, processing happens asynchronously. Monitor
the status using the ingestJobRunId
from the response to track both overall
job progress and individual document statuses.