Apple Silicon · MLX · Open-weight models

Overnight, your Mac fine-tunes
and ships a model that runs
on your phone.

With the evals to prove it's actually good. mlx-forge is a Mac-native fine-tuning factory for open-weight models.

Why local inference?

The next wave of AI isn't in the cloud — it's on the device in your pocket. Open-weight models are getting better faster than proprietary ones, and Apple Silicon makes the economics work for the first time.

🔒

Privacy by default

Training data and inference stay on your machine. No API calls, no data leaving the device.

⚡

Zero latency

No round-trip to a server. Responses start in milliseconds, not seconds.

💰

Zero inference cost

You own the weights. Once trained, inference is free — forever, at any scale.

📦

Open weights are the future

Qwen, Phi, Mistral, Llama — the open-weight ecosystem is accelerating. Local distribution is how they ship.

Evals are the product

Most fine-tuning tools show you a loss curve and call it done. mlx-forge makes evaluation the centre of the workflow. Without a trustworthy eval, a loop just produces bad models faster.

Every recipe ships a real scorer.py that returns a single comparable score (0–1). The auto-search loop uses that score as the ratchet — it only keeps experiments that beat the current best.

+¼

called_tool Output is a parseable tool call, not prose

+¼

valid_json Arguments field parses as a valid JSON dict

+¼

correct_name Function name matches the expected function

+¼

correct_args All expected argument keys present with correct values

Built-in recipes

Each recipe is self-contained: config, data, and a task-specific scorer. Copy one to start your own.

toolcalling

JSON Tool-calling

Fine-tune any model to reliably call functions. The reference recipe — build this first.

scorer: 4-criterion (0–1) · 100 train / 30 valid

edge_android

Edge / Mobile

Compact assistant for on-device deployment. Scores correctness and conciseness — verbose answers fail.

scorer: keyword + word budget · 100 train / 30 valid

healthcare_coding

Healthcare Coding

ICD-10-CM code assignment with mandatory abstention on out-of-scope requests. A confident wrong answer scores 0.

scorer: correct code or refusal · synthetic data only

data_flywheel

Data Flywheel

The model generates its own training data. Good outputs get added to train.jsonl. Retrain. Repeat.

generate → judge → append → retrain · 3 rounds default

How it works

Five steps from base model to a model running on your phone.

Download a base model

Pick any mlx-community model in HuggingFace safetensors format. 4-bit quantised 7B models fit in 4–6GB of unified memory.

Choose or create a recipe

Pick a built-in recipe (toolcalling, edge_android, healthcare_coding) or copy one and write your own eval.py. Minimum 100 training examples.

Fine-tune with LoRA on Apple Silicon

mlx-forge calls mlx_lm lora under the hood. A 500-iter run on a 7B model takes ~25 minutes on an M2 Pro.

Auto-search overnight (optional)

Run the ratchet loop: propose a hyperparameter change → train → score → keep if better → repeat. Git history is your experiment log.

Ship — when you decide it's ready

Fuse the LoRA adapter into full weights. Export to GGUF for Ollama on-device. Push to Hugging Face. mlx-forge never publishes automatically.

Quick start

Requires Apple Silicon Mac, macOS 14+, and uv.

bash

# clone and install
git clone https://github.com/abdouloued/mlx-forge
cd mlx-forge && uv sync

# run the test suite (no model needed — 120 tests)
uv run pytest -v

# fine-tune
uv run python -m core.train --recipe recipes/toolcalling/recipe.yaml

# evaluate
uv run python -m recipes.toolcalling.eval \
  --model-path adapters/toolcalling \
  --data-path recipes/toolcalling/data/valid.jsonl
# → tool_calling_score=0.XXXX

# run overnight loop (auto-search)
uv run python -m core.loop \
  --recipe recipes/toolcalling/recipe.yaml \
  --n-experiments 20 \
  --target-score 0.90

# fuse adapter into full weights
uv run python -m core.fuse --recipe recipes/toolcalling/recipe.yaml

Overnight, your Mac fine-tunes
and ships a model that runs
on your phone.

Why local inference?

Privacy by default

Zero latency

Zero inference cost

Open weights are the future

Evals are the product

Built-in recipes

JSON Tool-calling

Edge / Mobile

Healthcare Coding

Data Flywheel

How it works

Download a base model

Choose or create a recipe

Fine-tune with LoRA on Apple Silicon

Auto-search overnight (optional)

Ship — when you decide it's ready

Quick start

Get involved

⭐ Star on GitHub

🐛 Report a bug

🍴 Add a recipe

📋 Changelog

Overnight, your Mac fine-tunesand ships a model that runson your phone.

Why local inference?

Privacy by default

Zero latency

Zero inference cost

Open weights are the future

Evals are the product

Built-in recipes

JSON Tool-calling

Edge / Mobile

Healthcare Coding

Data Flywheel

How it works

Download a base model

Choose or create a recipe

Fine-tune with LoRA on Apple Silicon

Auto-search overnight (optional)

Ship — when you decide it's ready

Quick start

Get involved

⭐ Star on GitHub

🐛 Report a bug

🍴 Add a recipe

📋 Changelog

Overnight, your Mac fine-tunes
and ships a model that runs
on your phone.