Most fine-tuning tutorials assume you have a few hundred dollars of compute sitting around and unlimited patience. This one doesn’t.
After fine-tuning dozens of models across different tasks — classification, generation, summarisation, structured extraction — here’s what I’ve learned actually moves the needle.
The dataset matters more than the model size
Before you touch a single hyperparameter, fix your data. A 7B model with 500 high-quality examples will beat a 70B model on 5,000 noisy ones almost every time.
What “high quality” means in practice:
- Consistent format — your examples should all look the same. If some use markdown and some don’t, the model learns the inconsistency.
- No ambiguity in labels — if you’re not sure what the right answer is for an example, the model won’t be either.
- Representative distribution — your training set should look like your production inputs.
LoRA is almost always the right call
Full fine-tuning on consumer hardware is rarely worth it. LoRA (and its variants — QLoRA especially) lets you train adapters on top of a frozen base model with a fraction of the memory.
The key parameters to tune:
lora_config = LoraConfig(
r=16, # rank — start here, go up if underfitting
lora_alpha=32, # usually 2x rank
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
With QLoRA on a single A100 (or even a 3090), you can fine-tune a 13B model in a few hours.
Evaluation is the part people skip
Validation loss going down doesn’t mean your model is getting better at your task. Build a small eval set (50–100 examples) and measure what you actually care about.
For generation tasks, I usually track:
- Task-specific metrics (ROUGE, BERTScore, exact match depending on task)
- A small human eval pass on the worst-performing examples
- Regression on a held-out “golden” set that represents your core use case
The dirty secret: prompting often wins
Before you fine-tune anything, spend a week on your prompt. Seriously. Fine-tuning is expensive and slow to iterate. A well-engineered few-shot prompt with GPT-4 or Claude will outperform a poorly fine-tuned 7B model and cost you nothing upfront.
Fine-tune when:
- You have latency requirements prompting can’t meet
- You need to run on-prem or can’t send data to a third party
- You’ve hit the ceiling of what prompting can do and have the data to go further
More on this soon — next up I’ll cover evaluation pipelines for production fine-tuned models.