Features

Open source dataset management with a powerful CLI and Live Capture API. Local-first, deploy anywhere. Capture, curate, export — for fine-tuning or RAG.

Your infrastructure. Your data.

AI Curator is built around a fundamental principle: your data belongs to you. Run it on your Mac. Deploy it to Docker. Ship it to your cloud. It's always your infrastructure — ElGap never hosts it, never sees your data, never phones home.

Local by Default
Local SQLite at ~/.curator/. No cloud dependency. Inspect, backup, or delete anytime.
No Telemetry
Zero tracking, analytics, or phone-home. No network requests except ones you initiate.
Deploy Anywhere
Docker, VPS, your cloud — same open source app. Your MacBook for experiments, Kubernetes for production.
MIT License
Full source code transparency. Audit, modify, fork — it's all yours.

Your Mac. Your Docker container. Your cloud. Same AI Curator. When you deploy to your own server, it's self-hosting — not a SaaS. Your data, your infrastructure, your rules.

Security: own it

AI Curator is open source, provided as-is, in early beta. You can read every line of code. You can audit it yourself. You deploy it, you secure it, you own the infrastructure. We provide the software — you provide the operational security.

Open Source
MIT license. Full source on GitHub. Read it, audit it, fork it. Transparency is the most fundamental security guarantee.
Automated Scans
Dependency scanning and static analysis run on every commit. Known vulnerabilities are caught before they ship.
Early Beta
Provided as-is. No warranty, no SLA, no security guarantees. You run it at your own risk. This is honest software.
Human Audit Planned
A professional security audit will be commissioned once development stabilizes out of beta. Results will be published.

Security is on you because it's your infrastructure. The same reason you chose local-first is the same reason you own your security posture. As the security landscape evolves, ElGap will revisit this approach — but for now, transparency and honesty beat vague promises. Read the full security policy.

Universal Data Capture

Three capture modes cover every workflow: live streaming via HTTP API, file import (JSON, JSONL, CSV), and manual entry in the Web UI. If it can send a POST request, it can feed AI Curator. The same captured data can go to fine-tuning or RAG — you decide at export time.

Live Capture — any source
 curl -X POST http://localhost:3333/api/capture \ -H "Content-Type: application/json" \ -d '{ "source": "slack", "records": [{ "instruction": "How do I reset my password?", "output": "Go to Settings → Account → Reset Password...", "category": "support", "qualityRating": 4 }] }'
IDE Integrations
VS Code extension captures code explanations and refactoring decisions as they happen.
Log Processors
Pipe application logs to extract error-resolution pairs and user interactions automatically.
OpenWebUI Plugin
Official plugin captures conversations from your self-hosted AI chat interface.
Internal Documents
Import wikis, knowledge bases, and internal docs — curate before indexing for RAG.

No sample ships without your approval

Every sample goes through your review — whether it's training data or knowledge base content. Rate quality. Categorize by domain. Reject what doesn't meet your standards. Only approved data makes it to your training set or your RAG index.

Draft— awaiting review
In Review— being evaluated
Approved— ready to ship
Star Ratings
Rate samples 1–5 for quality. Filter exports by minimum rating.
Categories & Tags
Organize by domain — coding, writing, Q&A, support, internal docs.
Duplicate Detection
Automatically identify and flag duplicated content — critical for both training and RAG retrieval quality.
Staleness Control
Mark outdated documents. Outdated content in a RAG system produces outdated answers — review and reject before indexing.

Export for fine-tuning or RAG

Seven export formats optimized for different training pipelines and RAG indexing. Same curation, different destinations.

FormatFine-TuningRAG
alpacaStandard instruction format
sharegptMulti-turn dialogue
jsonlPipeline-ready streamingEmbedding pipeline input
csvAnalysis / spreadsheetsDocument metadata management
mlxApple Silicon (MLX-LM)
unslothFast, memory-efficient
trlHuggingFace ecosystem

All exports support filtering by status, quality rating, category, and tags. Train/test/validation splits with stratification are built in. JSONL and CSV exports work for both fine-tuning pipelines and RAG indexing pipelines.

REST API for Curation

The Live Capture API streams data in. The REST API manages curation out. Together, they close the loop: review, rate, approve, and reject — programmatically. Connect external tools, automation scripts, or AI agents to reduce human involvement in the curation loop.

Auto-approve above quality threshold
 import requests BASE = "http://localhost:3333/api" # Get pending samples with quality rating samples = requests.get(f"{BASE}/datasets/1/samples", params={ "status": "draft", "minQuality": 3 }).json() # Auto-approve quality 4+ with categories for sample in samples: if sample["qualityRating"] >= 4 and sample.get("category"): requests.post(f"{BASE}/samples/{sample['id']}/approve") else: requests.patch(f"{BASE}/samples/{sample['id']}", json={ "status": "in_review", "tags": ["needs-review"] })
Rule-Based Filters
Auto-approve above quality thresholds. Flag exceptions for human review only.
AI-Assisted Curation
Send samples to another LLM for pre-review. Rate quality, suggest categories, flag duplicates.
CI/CD Integration
Hook your pipeline into curation. Export on schedule with automated quality gates.
Compliance Workflows
Connect internal tools. Integrate with ticketing, content management, and review policies.
Coming

REST API endpoints for curation are under active development. See the planned API surface and automation patterns.

CLI & SDK

When you need to move fast and work at scale. Automate imports, exports, and dataset management from the terminal.

Terminal
 # Bulk import curator import massive-dataset.jsonl --dataset 1 --workers 8 # Filtered export for fine-tuning curator export --dataset 3 --format mlx \ --filter "status=approved AND quality>=4" # Export for RAG indexing curator export --dataset 5 --format jsonl \ --filter "status=approved AND category=docs" --output rag-corpus.jsonl # HuggingFace search & download curator search "python programming" curator download hf:openai/summarize_from_feedback --dataset 3 # Split for training curator export --split "0.8,0.1,0.1" --seed 42 --format jsonl
Coming Soon
AI Curator SDK
Embed capture, curate, and export directly in your code — Python, Node, whatever you use. The SDK wraps the Live Capture API and REST API into a first-class programmatic interface. See the roadmap.

Ready to start?

Open source. Local-first. Deploy anywhere. Capture, curate, export — for fine-tuning or RAG.