Features
Open source dataset management with a powerful CLI and Live Capture API. Local-first, deploy anywhere. Capture, curate, export — for fine-tuning or RAG.
Your infrastructure. Your data.
AI Curator is built around a fundamental principle: your data belongs to you. Run it on your Mac. Deploy it to Docker. Ship it to your cloud. It's always your infrastructure — ElGap never hosts it, never sees your data, never phones home.
~/.curator/. No cloud dependency. Inspect, backup, or delete anytime.Your Mac. Your Docker container. Your cloud. Same AI Curator. When you deploy to your own server, it's self-hosting — not a SaaS. Your data, your infrastructure, your rules.
Security: own it
AI Curator is open source, provided as-is, in early beta. You can read every line of code. You can audit it yourself. You deploy it, you secure it, you own the infrastructure. We provide the software — you provide the operational security.
Security is on you because it's your infrastructure. The same reason you chose local-first is the same reason you own your security posture. As the security landscape evolves, ElGap will revisit this approach — but for now, transparency and honesty beat vague promises. Read the full security policy.
Universal Data Capture
Three capture modes cover every workflow: live streaming via HTTP API, file import (JSON, JSONL, CSV), and manual entry in the Web UI. If it can send a POST request, it can feed AI Curator. The same captured data can go to fine-tuning or RAG — you decide at export time.
curl -X POST http://localhost:3333/api/capture \ -H "Content-Type: application/json" \ -d '{ "source": "slack", "records": [{ "instruction": "How do I reset my password?", "output": "Go to Settings → Account → Reset Password...", "category": "support", "qualityRating": 4 }] }'No sample ships without your approval
Every sample goes through your review — whether it's training data or knowledge base content. Rate quality. Categorize by domain. Reject what doesn't meet your standards. Only approved data makes it to your training set or your RAG index.
Export for fine-tuning or RAG
Seven export formats optimized for different training pipelines and RAG indexing. Same curation, different destinations.
| Format | Fine-Tuning | RAG |
|---|---|---|
alpaca | Standard instruction format | — |
sharegpt | Multi-turn dialogue | — |
jsonl | Pipeline-ready streaming | Embedding pipeline input |
csv | Analysis / spreadsheets | Document metadata management |
mlx | Apple Silicon (MLX-LM) | — |
unsloth | Fast, memory-efficient | — |
trl | HuggingFace ecosystem | — |
All exports support filtering by status, quality rating, category, and tags. Train/test/validation splits with stratification are built in. JSONL and CSV exports work for both fine-tuning pipelines and RAG indexing pipelines.
REST API for Curation
The Live Capture API streams data in. The REST API manages curation out. Together, they close the loop: review, rate, approve, and reject — programmatically. Connect external tools, automation scripts, or AI agents to reduce human involvement in the curation loop.
import requests BASE = "http://localhost:3333/api" # Get pending samples with quality rating samples = requests.get(f"{BASE}/datasets/1/samples", params={ "status": "draft", "minQuality": 3 }).json() # Auto-approve quality 4+ with categories for sample in samples: if sample["qualityRating"] >= 4 and sample.get("category"): requests.post(f"{BASE}/samples/{sample['id']}/approve") else: requests.patch(f"{BASE}/samples/{sample['id']}", json={ "status": "in_review", "tags": ["needs-review"] })REST API endpoints for curation are under active development. See the planned API surface and automation patterns.
CLI & SDK
When you need to move fast and work at scale. Automate imports, exports, and dataset management from the terminal.
# Bulk import curator import massive-dataset.jsonl --dataset 1 --workers 8 # Filtered export for fine-tuning curator export --dataset 3 --format mlx \ --filter "status=approved AND quality>=4" # Export for RAG indexing curator export --dataset 5 --format jsonl \ --filter "status=approved AND category=docs" --output rag-corpus.jsonl # HuggingFace search & download curator search "python programming" curator download hf:openai/summarize_from_feedback --dataset 3 # Split for training curator export --split "0.8,0.1,0.1" --seed 42 --format jsonlReady to start?
Open source. Local-first. Deploy anywhere. Capture, curate, export — for fine-tuning or RAG.