Fine-tuning LLMs on a budget: what actually works

Most fine-tuning tutorials assume you have a few hundred dollars of compute sitting around and unlimited patience. This one doesn’t.

After fine-tuning dozens of models across different tasks — classification, generation, summarisation, structured extraction — here’s what I’ve learned actually moves the needle.

The dataset matters more than the model size

Before you touch a single hyperparameter, fix your data. A 7B model with 500 high-quality examples will beat a 70B model on 5,000 noisy ones almost every time.

What “high quality” means in practice:

Consistent format — your examples should all look the same. If some use markdown and some don’t, the model learns the inconsistency.
No ambiguity in labels — if you’re not sure what the right answer is for an example, the model won’t be either.
Representative distribution — your training set should look like your production inputs.

LoRA is almost always the right call

Full fine-tuning on consumer hardware is rarely worth it. LoRA (and its variants — QLoRA especially) lets you train adapters on top of a frozen base model with a fraction of the memory.

The key parameters to tune:

lora_config = LoraConfig(
    r=16,              # rank — start here, go up if underfitting
    lora_alpha=32,     # usually 2x rank
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

With QLoRA on a single A100 (or even a 3090), you can fine-tune a 13B model in a few hours.

Evaluation is the part people skip

Validation loss going down doesn’t mean your model is getting better at your task. Build a small eval set (50–100 examples) and measure what you actually care about.

For generation tasks, I usually track:

Task-specific metrics (ROUGE, BERTScore, exact match depending on task)
A small human eval pass on the worst-performing examples
Regression on a held-out “golden” set that represents your core use case

The dirty secret: prompting often wins

Before you fine-tune anything, spend a week on your prompt. Seriously. Fine-tuning is expensive and slow to iterate. A well-engineered few-shot prompt with GPT-4 or Claude will outperform a poorly fine-tuned 7B model and cost you nothing upfront.

Fine-tune when:

You have latency requirements prompting can’t meet
You need to run on-prem or can’t send data to a third party
You’ve hit the ceiling of what prompting can do and have the data to go further

More on this soon — next up I’ll cover evaluation pipelines for production fine-tuned models.