Fine-tuning Llama 3.2 on Custom Datasets — What I Learned

After three weeks of iteration on fine-tuning Llama 3.2 with custom datasets, here are the practical lessons that saved me from wasting GPU hours.

The Setup

I spent three weeks fine-tuning Llama 3.2 3B on a custom dataset of technical documentation and code review comments. The goal: a model that understands our specific codebase conventions and gives relevant, contextual code reviews.

What I Got Wrong First

Mistake 1: Dataset quality over quantity

I started with 50,000 examples scraped from GitHub PRs. The model learned nothing useful because the signal-to-noise ratio was terrible. Random GitHub PRs contain too many edge cases, humor, and context that doesn't generalize.

Cutting to 8,000 highly curated examples and filtering for: technical depth, actionable feedback, and concrete code references — that's when the model started to get useful.

Mistake 2: Learning rate schedule

I used a constant learning rate and watched the model catastrophically forget everything it knew. Use cosine annealing with warmup. Always.

`python from transformers import get_cosine_schedule_with_warmup

scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=100, num_training_steps=total_steps ) `

The Dataset Format That Worked

Structured as instruction/output pairs with strong system prompts:

`json { "instruction": "Review this TypeScript function for correctness and style", "input": "function fetchUser(id: number) { return fetch('/api/user/' + id) }", "output": "Issues found:\n1. Missing async/await — fetch returns a Promise\n2. No error handling\n3. String concatenation instead of template literals\n\nFixed version:..." } `

QLoRA — Your GPU Budget's Best Friend

4-bit quantization with LoRA adapters. Run a 7B model on a single A100 40GB. This is the way.

`python from peft import LoraConfig, get_peft_model from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, )

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) `

Results

After 3 epochs on the curated dataset, the model produces code review comments that are: - Contextually aware of TypeScript specifics - Actionable (not vague "consider improving X") - Consistent in format with our team's style

Eval on held-out examples: 73% rated "useful or better" by team members vs 31% for GPT-4 zero-shot on the same examples. Context matters.

What I'd Do Differently

1. Start with 1,000 examples, validate the approach, then scale 2. Use Weights & Biases from day one — tracking is non-negotiable 3. Build the eval harness BEFORE training, not after 4. Have humans rate outputs during training, not just perplexity

The model is now running in production for our internal code review pipeline. Inference cost: ~$0.0001 per review via vLLM.

Gordon Todd Watts

Full Stack Developer & AI Builder based in West Palm Beach, FL. Building realtime systems, LLM integrations, and open-source tools.

GitHub →X →

← All posts