Features

Open source dataset management with a powerful CLI and Live Capture API. Local-first, deploy anywhere. Capture, curate, export — for fine-tuning or RAG.

Your infrastructure. Your data.

AI Curator is built around a fundamental principle: your data belongs to you. Run it on your Mac. Deploy it to Docker. Ship it to your cloud. It's always your infrastructure — ElGap never hosts it, never sees your data, never phones home.

Local by Default

Local SQLite at ~/.curator/. No cloud dependency. Inspect, backup, or delete anytime.

No Telemetry

Zero tracking, analytics, or phone-home. No network requests except ones you initiate.

Deploy Anywhere

Docker, VPS, your cloud — same open source app. Your MacBook for experiments, Kubernetes for production.

MIT License

Full source code transparency. Audit, modify, fork — it's all yours.

Your Mac. Your Docker container. Your cloud. Same AI Curator. When you deploy to your own server, it's self-hosting — not a SaaS. Your data, your infrastructure, your rules.

Security: own it

AI Curator is open source, provided as-is, in early beta. You can read every line of code. You can audit it yourself. You deploy it, you secure it, you own the infrastructure. We provide the software — you provide the operational security.

Open Source

MIT license. Full source on GitHub. Read it, audit it, fork it. Transparency is the most fundamental security guarantee.

Automated Scans

Dependency scanning and static analysis run on every commit. Known vulnerabilities are caught before they ship.

Early Beta

Provided as-is. No warranty, no SLA, no security guarantees. You run it at your own risk. This is honest software.

Human Audit Planned

A professional security audit will be commissioned once development stabilizes out of beta. Results will be published.

Security is on you because it's your infrastructure. The same reason you chose local-first is the same reason you own your security posture. As the security landscape evolves, ElGap will revisit this approach — but for now, transparency and honesty beat vague promises. Read the full security policy.

Universal Data Capture

Three capture modes cover every workflow: live streaming via HTTP API, file import (JSON, JSONL, CSV), and manual entry in the Web UI. If it can send a POST request, it can feed AI Curator. The same captured data can go to fine-tuning or RAG — you decide at export time.

Live Capture — any source

 curl -X POST http://localhost:3333/api/capture \ -H "Content-Type: application/json" \ -d '{ "source": "slack", "records": [{ "instruction": "How do I reset my password?", "output": "Go to Settings → Account → Reset Password...", "category": "support", "qualityRating": 4 }] }'

IDE Integrations

VS Code extension captures code explanations and refactoring decisions as they happen.

Log Processors

Pipe application logs to extract error-resolution pairs and user interactions automatically.

OpenWebUI Plugin

Official plugin captures conversations from your self-hosted AI chat interface.

Internal Documents

Import wikis, knowledge bases, and internal docs — curate before indexing for RAG.

No sample ships without your approval

Every sample goes through your review — whether it's training data or knowledge base content. Rate quality. Categorize by domain. Reject what doesn't meet your standards. Only approved data makes it to your training set or your RAG index.

Draft— awaiting review

In Review— being evaluated

Approved— ready to ship

Star Ratings

Rate samples 1–5 for quality. Filter exports by minimum rating.

Categories & Tags

Organize by domain — coding, writing, Q&A, support, internal docs.

Duplicate Detection

Automatically identify and flag duplicated content — critical for both training and RAG retrieval quality.

Staleness Control

Mark outdated documents. Outdated content in a RAG system produces outdated answers — review and reject before indexing.

Export for fine-tuning or RAG

Seven export formats optimized for different training pipelines and RAG indexing. Same curation, different destinations.

Format	Fine-Tuning	RAG
`alpaca`	Standard instruction format	—
`sharegpt`	Multi-turn dialogue	—
`jsonl`	Pipeline-ready streaming	Embedding pipeline input
`csv`	Analysis / spreadsheets	Document metadata management
`mlx`	Apple Silicon (MLX-LM)	—
`unsloth`	Fast, memory-efficient	—
`trl`	HuggingFace ecosystem	—

All exports support filtering by status, quality rating, category, and tags. Train/test/validation splits with stratification are built in. JSONL and CSV exports work for both fine-tuning pipelines and RAG indexing pipelines.

REST API for Curation

The Live Capture API streams data in. The REST API manages curation out. Together, they close the loop: review, rate, approve, and reject — programmatically. Connect external tools, automation scripts, or AI agents to reduce human involvement in the curation loop.

Auto-approve above quality threshold

 import requests BASE = "http://localhost:3333/api" # Get pending samples with quality rating samples = requests.get(f"{BASE}/datasets/1/samples", params={ "status": "draft", "minQuality": 3 }).json() # Auto-approve quality 4+ with categories for sample in samples: if sample["qualityRating"] >= 4 and sample.get("category"): requests.post(f"{BASE}/samples/{sample['id']}/approve") else: requests.patch(f"{BASE}/samples/{sample['id']}", json={ "status": "in_review", "tags": ["needs-review"] })

Rule-Based Filters

Auto-approve above quality thresholds. Flag exceptions for human review only.

AI-Assisted Curation

Send samples to another LLM for pre-review. Rate quality, suggest categories, flag duplicates.

CI/CD Integration

Hook your pipeline into curation. Export on schedule with automated quality gates.

Compliance Workflows

Connect internal tools. Integrate with ticketing, content management, and review policies.

Coming

REST API endpoints for curation are under active development. See the planned API surface and automation patterns.

CLI & SDK

When you need to move fast and work at scale. Automate imports, exports, and dataset management from the terminal.

Terminal

 # Bulk import curator import massive-dataset.jsonl --dataset 1 --workers 8 # Filtered export for fine-tuning curator export --dataset 3 --format mlx \ --filter "status=approved AND quality>=4" # Export for RAG indexing curator export --dataset 5 --format jsonl \ --filter "status=approved AND category=docs" --output rag-corpus.jsonl # HuggingFace search & download curator search "python programming" curator download hf:openai/summarize_from_feedback --dataset 3 # Split for training curator export --split "0.8,0.1,0.1" --seed 42 --format jsonl

Coming Soon

AI Curator SDK

Embed capture, curate, and export directly in your code — Python, Node, whatever you use. The SDK wraps the Live Capture API and REST API into a first-class programmatic interface. See the roadmap.

Ready to start?

Open source. Local-first. Deploy anywhere. Capture, curate, export — for fine-tuning or RAG.

Installation Guide CLI Reference