Export Formats

Seven export formats for fine-tuning and RAG pipelines. Smart splitting, filtering, and stratification built in.

Format Overview

Same curation, different destinations. Export curated data for fine-tuning (Alpaca, MLX, Unsloth, ShareGPT, TRL) or for RAG indexing pipelines (JSONL, CSV). The preparation is identical — only the format changes.

FormatFlagFine-TuningRAG
Alpaca--format alpacaStandard instruction format
ShareGPT--format sharegptMulti-turn conversations
JSONL--format jsonlPipeline-ready streamingEmbedding pipeline input
CSV--format csvAnalysis / spreadsheetsDocument metadata
MLX--format mlxApple Silicon (MLX-LM)
Unsloth--format unslothFast, memory-efficient
TRL--format trlHuggingFace ecosystem
Terminal
 # Basic export curator export --dataset 2 --format mlx --output train.jsonl # Filtered export (approved samples, quality >= 4) curator export --dataset 1 --format unsloth --output train.jsonl \ --filter "status=approved AND quality>=4" # Split for train/test/validation with stratification curator export --split "0.8,0.1,0.1" --seed 42 --format jsonl \ --output dataset

Alpaca

The standard instruction format. Compatible with most fine-tuning frameworks.

alpaca.json
 [ { "instruction": "Explain this error", "input": "TypeError: Cannot read property 'id' of undefined", "output": "This error occurs when you try to access a property on an undefined value..." } ]

ShareGPT

Multi-turn conversation format. Ideal for dialogue and chat models.

sharegpt.json
 [ { "conversations": [ { "from": "system", "value": "You are a helpful assistant." }, { "from": "human", "value": "How do I reset my password?" }, { "from": "gpt", "value": "Go to Settings → Account → Reset Password..." } ] } ]

JSONL

Line-delimited JSON. Streaming-friendly format for data pipelines.

dataset.jsonl
 {"instruction":"Explain this error","input":"","output":"This error occurs...","category":"coding"} {"instruction":"Summarize this document","input":"Long text...","output":"The document describes...","category":"writing"}

CSV

Tabular format with headers. Useful for analysis and spreadsheet workflows.

dataset.csv
 instruction,input,output,category,quality "Explain this error","TypeError...","This error occurs...","coding",5 "Summarize this document","Long text...","The document describes...","writing",4

MLX

Optimized for Apple Silicon. Train models on M1/M2/M3/M4 chips using MLX-LM.

train.jsonl
 {"text":"This error occurs..."} {"text":"The document describes..."}

Optimized for mlx-lm fine-tuning on Apple Silicon. Pair with MLX-LM for local training on Mac.

Unsloth

Memory-efficient and fast fine-tuning format. Works with the Unsloth library for accelerated training.

train.jsonl
 {"instruction":"Explain this error","input":"","output":"This error occurs..."} {"instruction":"Summarize this document","input":"Long text...","output":"The document describes..."}

TRL

HuggingFace Transformer Reinforcement Learning format. Integrates with the TRL ecosystem for reward modeling and RLHF.

train.jsonl
 {"prompt":"Explain this error","chosen":"This error occurs...","rejected":"I don't know."} {"prompt":"Summarize this document","chosen":"The document describes...","rejected":"Not sure."}

Export Options

OptionExampleDescription
--filter"status=approved AND quality>=4"Filter samples by status, quality, category
--split"0.8,0.1,0.1"Train/test/validation split ratios
--seed42Random seed for reproducible splits
--outputtrain.jsonlOutput file path

Stratified splitting maintains category proportions across train/test/validation sets, ensuring balanced representation.