Apple Silicon Β· MLX Β· Open-weight models

Overnight, your Mac fine-tunes
and ships a model that runs
on your phone.

With the evals to prove it's actually good. mlx-forge is a Mac-native fine-tuning factory for open-weight models.

Quick start β†’ View on GitHub
πŸ€–
Base model
(HF safetensors)
β†’
⚑
MLX LoRA
fine-tune
β†’
πŸ“Š
Eval score
(per task)
β†’
πŸ”€
Auto-search
ratchet loop
β†’
πŸ“¦
Fuse + export
GGUF / HF
β†’
πŸ“±
Runs on
phone / edge
128GB
unified memory β€” no VRAM wall
4
recipes ready to run
120
tests, no model needed
0
auto-publishes β€” you control shipping

Why local inference?

The next wave of AI isn't in the cloud β€” it's on the device in your pocket. Open-weight models are getting better faster than proprietary ones, and Apple Silicon makes the economics work for the first time.

πŸ”’

Privacy by default

Training data and inference stay on your machine. No API calls, no data leaving the device.

⚑

Zero latency

No round-trip to a server. Responses start in milliseconds, not seconds.

πŸ’°

Zero inference cost

You own the weights. Once trained, inference is free β€” forever, at any scale.

πŸ“¦

Open weights are the future

Qwen, Phi, Mistral, Llama β€” the open-weight ecosystem is accelerating. Local distribution is how they ship.

Evals are the product

Most fine-tuning tools show you a loss curve and call it done. mlx-forge makes evaluation the centre of the workflow. Without a trustworthy eval, a loop just produces bad models faster.

Every recipe ships a real scorer.py that returns a single comparable score (0–1). The auto-search loop uses that score as the ratchet β€” it only keeps experiments that beat the current best.

+ΒΌ
called_tool Output is a parseable tool call, not prose
+ΒΌ
valid_json Arguments field parses as a valid JSON dict
+ΒΌ
correct_name Function name matches the expected function
+ΒΌ
correct_args All expected argument keys present with correct values

Built-in recipes

Each recipe is self-contained: config, data, and a task-specific scorer. Copy one to start your own.

toolcalling

JSON Tool-calling

Fine-tune any model to reliably call functions. The reference recipe β€” build this first.

scorer: 4-criterion (0–1)  Β·  100 train / 30 valid
edge_android

Edge / Mobile

Compact assistant for on-device deployment. Scores correctness and conciseness β€” verbose answers fail.

scorer: keyword + word budget  Β·  100 train / 30 valid
healthcare_coding

Healthcare Coding

ICD-10-CM code assignment with mandatory abstention on out-of-scope requests. A confident wrong answer scores 0.

scorer: correct code or refusal  Β·  synthetic data only
data_flywheel

Data Flywheel

The model generates its own training data. Good outputs get added to train.jsonl. Retrain. Repeat.

generate β†’ judge β†’ append β†’ retrain  Β·  3 rounds default

How it works

Five steps from base model to a model running on your phone.

1

Download a base model

Pick any mlx-community model in HuggingFace safetensors format. 4-bit quantised 7B models fit in 4–6GB of unified memory.

2

Choose or create a recipe

Pick a built-in recipe (toolcalling, edge_android, healthcare_coding) or copy one and write your own eval.py. Minimum 100 training examples.

3

Fine-tune with LoRA on Apple Silicon

mlx-forge calls mlx_lm lora under the hood. A 500-iter run on a 7B model takes ~25 minutes on an M2 Pro.

4

Auto-search overnight (optional)

Run the ratchet loop: propose a hyperparameter change β†’ train β†’ score β†’ keep if better β†’ repeat. Git history is your experiment log.

5

Ship β€” when you decide it's ready

Fuse the LoRA adapter into full weights. Export to GGUF for Ollama on-device. Push to Hugging Face. mlx-forge never publishes automatically.

Quick start

Requires Apple Silicon Mac, macOS 14+, and uv.

bash
# clone and install
git clone https://github.com/abdouloued/mlx-forge
cd mlx-forge && uv sync

# run the test suite (no model needed β€” 120 tests)
uv run pytest -v

# fine-tune
uv run python -m core.train --recipe recipes/toolcalling/recipe.yaml

# evaluate
uv run python -m recipes.toolcalling.eval \
  --model-path adapters/toolcalling \
  --data-path recipes/toolcalling/data/valid.jsonl
# β†’ tool_calling_score=0.XXXX

# run overnight loop (auto-search)
uv run python -m core.loop \
  --recipe recipes/toolcalling/recipe.yaml \
  --n-experiments 20 \
  --target-score 0.90

# fuse adapter into full weights
uv run python -m core.fuse --recipe recipes/toolcalling/recipe.yaml

Get involved

mlx-forge is MIT licensed. Contributions, new recipes, and bug reports are welcome.

⭐ Star on GitHub

Browse the source, open issues, and follow development.

πŸ› Report a bug

Something broken? Open an issue with the bug report template.

🍴 Add a recipe

Read the contributing guide and add your own fine-tuning task.

πŸ“‹ Changelog

See what's new in each release.