With the evals to prove it's actually good. mlx-forge is a Mac-native fine-tuning factory for open-weight models.
The next wave of AI isn't in the cloud β it's on the device in your pocket. Open-weight models are getting better faster than proprietary ones, and Apple Silicon makes the economics work for the first time.
Training data and inference stay on your machine. No API calls, no data leaving the device.
No round-trip to a server. Responses start in milliseconds, not seconds.
You own the weights. Once trained, inference is free β forever, at any scale.
Qwen, Phi, Mistral, Llama β the open-weight ecosystem is accelerating. Local distribution is how they ship.
Most fine-tuning tools show you a loss curve and call it done. mlx-forge makes evaluation the centre of the workflow. Without a trustworthy eval, a loop just produces bad models faster.
Every recipe ships a real scorer.py that returns a single comparable
score (0β1). The auto-search loop uses that score as the ratchet β it only
keeps experiments that beat the current best.
Each recipe is self-contained: config, data, and a task-specific scorer. Copy one to start your own.
Fine-tune any model to reliably call functions. The reference recipe β build this first.
Compact assistant for on-device deployment. Scores correctness and conciseness β verbose answers fail.
ICD-10-CM code assignment with mandatory abstention on out-of-scope requests. A confident wrong answer scores 0.
The model generates its own training data. Good outputs get added to train.jsonl. Retrain. Repeat.
Five steps from base model to a model running on your phone.
Pick any mlx-community model in HuggingFace safetensors format. 4-bit quantised 7B models fit in 4β6GB of unified memory.
Pick a built-in recipe (toolcalling, edge_android, healthcare_coding) or copy one and write your own eval.py. Minimum 100 training examples.
mlx-forge calls mlx_lm lora under the hood. A 500-iter run on a 7B model takes ~25 minutes on an M2 Pro.
Run the ratchet loop: propose a hyperparameter change β train β score β keep if better β repeat. Git history is your experiment log.
Fuse the LoRA adapter into full weights. Export to GGUF for Ollama on-device. Push to Hugging Face. mlx-forge never publishes automatically.
Requires Apple Silicon Mac, macOS 14+, and uv.
# clone and install git clone https://github.com/abdouloued/mlx-forge cd mlx-forge && uv sync # run the test suite (no model needed β 120 tests) uv run pytest -v # fine-tune uv run python -m core.train --recipe recipes/toolcalling/recipe.yaml # evaluate uv run python -m recipes.toolcalling.eval \ --model-path adapters/toolcalling \ --data-path recipes/toolcalling/data/valid.jsonl # β tool_calling_score=0.XXXX # run overnight loop (auto-search) uv run python -m core.loop \ --recipe recipes/toolcalling/recipe.yaml \ --n-experiments 20 \ --target-score 0.90 # fuse adapter into full weights uv run python -m core.fuse --recipe recipes/toolcalling/recipe.yaml
mlx-forge is MIT licensed. Contributions, new recipes, and bug reports are welcome.