tldr: Models that differ by a finetune are can be easily model-diffed with assumptions of twin studies.
When we train neural networks, especially LLMs, we often forget that models know huge portions of the internet. Finetuning two models that were initially trained on the same, vast amounts of pretraining data result in two checkpoints that differ by a finetune.
Two common finetuning methods are chat-finetuning and reasoning chain distillation.
1. Chat-Finetuning: ~28,000:1
Real-world evidence:
- Stanford Alpaca: Used 52K instruction-following demonstrations to fine-tune LLaMA-7B, which was pretrained on 1.4 trillion tokens arXivMlexpert
- Calculation: Assuming ~100 tokens per instruction-response pair = ~5.2 million tokens
- Actual ratio: 1.4 trillion ÷ 5.2 million = ~269,000:1
- InstructGPT: Used 13K prompts for SFT against GPT-3’s hundreds of billions of pretraining tokens MediumMeasure Zero
- Conservative estimate: 300B ÷ ~2.6M tokens = ~115,000:1
Estimate: 50,000:1 to 300,000:1
2. Reasoning Distillation: ~500,000:1
Here we use the most sample-efficient reasoning distillation example, the S1 paper which uses 1000 curated reasoning dataset.
Minimal Reasoning distillation
- Reasoning Distillation: ~500,000:1 (Stanford S1 Provides Precise Evidence)
- Stanford S1: simple test-time scaling results:
- Used only 1,000 carefully curated reasoning examples (s1K dataset)
- Base model Qwen2.5-32B-Instruct was “exposed to large amounts of reasoning data during pretraining which spans trillions of tokens”
- Total reasoning data: 4.7M tokens for 1,000 samples = ~4,700 tokens per example
Given the sample efficiency of these commonly used finetuning methods (often greater than 3 orders of magnitude), the degree of representational change can be questioned. How much of the weights change through these finetunes? How about their activations given datapoints?
One hypothesis on the difference of representation is:
- Finetuning fundamentally changes the relationship between circuits in the models, transitioning one model to a completely new one.
- It makes slight and trackable changes to the representation.
If truth is closer to the first hypothesis, it seems useful to track these incremental changes when a model undergoes any finetuning that result in noticeable impact on behavior (e.g. reasoning distillation).
Hypothesis #2 is supported by the “Instruction following without Instruction-tuning” paper, where:
- Instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation”.
- They also found that only tuning the model on the response yields instruction tuning.
- Hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pre-trained model.
I say these findings support the little-change hypothesis because, although they have not performed representation level analyses, it seems that small number of finetune datapoints can lead to noticeable changes in behavior.
Let’s zoom out a bit. Is it possible to map the lineage of changes to a model?