Guides

Building Fine-Tuning & Custom Models with open-source tools

This guide outlines the technical workflow for fine-tuning a foundation model using Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA. It focuses on the transition from prompt engineering to a proprietary model, emphasizing dataset integrity, hardware efficiency, and rigorous evaluation against a baseline.

6-10 hours (excluding training compute time)6 steps

Dataset Schema Definition and Formatting

Standardize your raw data into a consistent JSONL format. For instruction tuning, use the Alpaca format or the ShareGPT format. Ensure every entry has a clear 'instruction', 'input' (optional context), and 'output'. The quality of the output must represent the 'gold standard' you expect from the model.

dataset.jsonl

[
  {
    "instruction": "Extract the expiration date from the following medical invoice.",
    "input": "Invoice #1234, Date: 2023-10-01, Expires: 2024-10-01",
    "output": "2024-10-01"
  }
]

⚠ Common Pitfalls

•Including duplicate entries which lead to overfitting on specific phrases.
•Inconsistent delimiters that confuse the tokenizer during training.

Tokenization and Prompt Template Application

Apply the specific chat template required by your base model (e.g., Llama-3, Mistral, or Phi-3). Use the AutoTokenizer from the Transformers library to ensure special tokens like <|end_of_text|> or [INST] are correctly inserted. Mismatched templates between training and inference are the primary cause of model hallucinations.

preprocess.py

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

⚠ Common Pitfalls

•Using the wrong chat template for the base model architecture.
•Failing to set the padding token, leading to runtime errors during batching.

Configuring QLoRA for Memory Efficiency

To train on consumer or mid-tier enterprise hardware, use 4-bit quantization (QLoRA). Define the LoRA rank (r) and alpha. A rank of 16 or 32 is typically sufficient for domain adaptation. Target the linear layers (q_proj, k_proj, v_proj, o_proj) to maximize parameter efficiency.

peft_config.py

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

⚠ Common Pitfalls

•Setting rank (r) too high, which increases VRAM usage without proportional quality gains.
•Forgetting to target the 'o_proj' or 'gate_proj' layers in newer architectures like Llama-3.

Training Execution and Loss Monitoring

Initialize the Trainer with a small learning rate (e.g., 2e-4) and a cosine learning rate scheduler. Integrate Weights & Biases to track the training loss vs. validation loss. If validation loss begins to diverge or rise while training loss falls, stop training to prevent overfitting.

train.py

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    report_to="wandb"
)

⚠ Common Pitfalls

•Setting gradient_accumulation_steps too low, resulting in unstable gradients.
•Ignoring the 'effective batch size' (batch_size * accumulation_steps * num_gpus).

Model Merging and Quantization for Deployment

After training, you have a set of small 'adapter' weights. Merge these back into the original FP16 base model to eliminate inference latency overhead. Once merged, use AutoGPTQ or AutoAWQ to quantize the model to 4-bit for high-throughput serving via vLLM or Ollama.

merge_weights.py

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("base_model_path")
model = PeftModel.from_pretrained(base_model, "adapter_path")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./final_model")

⚠ Common Pitfalls

•Trying to merge a quantized 4-bit base model with adapters; you must merge using the FP16/BF16 version of the base model.
•Mismatched tokenizer files in the final export directory.

Automated Evaluation (LLM-as-a-Judge)

Create a hold-out test set of 50-100 prompts. Run inference using your fine-tuned model and a baseline (e.g., the base model or GPT-3.5). Use a stronger model (GPT-4o) to score the outputs based on specific criteria: accuracy, tone, and formatting compliance.

⚠ Common Pitfalls

•Evaluating on the training data, which provides a false sense of model capability.
•Using generic metrics like ROUGE or BLEU, which correlate poorly with human preference in creative or complex reasoning tasks.

What you built

Successful fine-tuning is an iterative loop of data curation, hyperparameter adjustment, and rigorous evaluation. By using PEFT and QLoRA, you can achieve domain-specific performance that exceeds general-purpose models at a fraction of the inference cost. Prioritize data quality over quantity to avoid model degradation.