Export Formats
Seven export formats for fine-tuning and RAG pipelines. Smart splitting, filtering, and stratification built in.
Format Overview
Same curation, different destinations. Export curated data for fine-tuning (Alpaca, MLX, Unsloth, ShareGPT, TRL) or for RAG indexing pipelines (JSONL, CSV). The preparation is identical — only the format changes.
| Format | Flag | Fine-Tuning | RAG |
|---|---|---|---|
| Alpaca | --format alpaca | Standard instruction format | — |
| ShareGPT | --format sharegpt | Multi-turn conversations | — |
| JSONL | --format jsonl | Pipeline-ready streaming | Embedding pipeline input |
| CSV | --format csv | Analysis / spreadsheets | Document metadata |
| MLX | --format mlx | Apple Silicon (MLX-LM) | — |
| Unsloth | --format unsloth | Fast, memory-efficient | — |
| TRL | --format trl | HuggingFace ecosystem | — |
# Basic export curator export --dataset 2 --format mlx --output train.jsonl # Filtered export (approved samples, quality >= 4) curator export --dataset 1 --format unsloth --output train.jsonl \ --filter "status=approved AND quality>=4" # Split for train/test/validation with stratification curator export --split "0.8,0.1,0.1" --seed 42 --format jsonl \ --output datasetAlpaca
The standard instruction format. Compatible with most fine-tuning frameworks.
[ { "instruction": "Explain this error", "input": "TypeError: Cannot read property 'id' of undefined", "output": "This error occurs when you try to access a property on an undefined value..." } ]JSONL
Line-delimited JSON. Streaming-friendly format for data pipelines.
{"instruction":"Explain this error","input":"","output":"This error occurs...","category":"coding"} {"instruction":"Summarize this document","input":"Long text...","output":"The document describes...","category":"writing"}CSV
Tabular format with headers. Useful for analysis and spreadsheet workflows.
instruction,input,output,category,quality "Explain this error","TypeError...","This error occurs...","coding",5 "Summarize this document","Long text...","The document describes...","writing",4MLX
Optimized for Apple Silicon. Train models on M1/M2/M3/M4 chips using MLX-LM.
{"text":""} {"text":""}Optimized for mlx-lm fine-tuning on Apple Silicon. Pair with MLX-LM for local training on Mac.
Unsloth
Memory-efficient and fast fine-tuning format. Works with the Unsloth library for accelerated training.
{"instruction":"Explain this error","input":"","output":"This error occurs..."} {"instruction":"Summarize this document","input":"Long text...","output":"The document describes..."}TRL
HuggingFace Transformer Reinforcement Learning format. Integrates with the TRL ecosystem for reward modeling and RLHF.
{"prompt":"Explain this error","chosen":"This error occurs...","rejected":"I don't know."} {"prompt":"Summarize this document","chosen":"The document describes...","rejected":"Not sure."}Export Options
| Option | Example | Description |
|---|---|---|
--filter | "status=approved AND quality>=4" | Filter samples by status, quality, category |
--split | "0.8,0.1,0.1" | Train/test/validation split ratios |
--seed | 42 | Random seed for reproducible splits |
--output | train.jsonl | Output file path |
Stratified splitting maintains category proportions across train/test/validation sets, ensuring balanced representation.