I recently posted about the scaling hypothesis. In the last couple of months:
-
DeepSeek took the whole AI world by storm. They were able to train an extremely capable model for cheap.
-
They reached close to frontier models by being extremely clever, finding algorithmic improvements and being really creative wit the constraints they had.
-
They have also open sourced a lot of their research!
-
-
Grok 3 was released and it's possibly the first model trained on 1e26 FLOPs.
-
GPT 4.5 was released as a research preview.
-
It's believed to be OpenAI's largest model yet, though we don't yet know clear details about the model size or the size of the training run.
-
Right now: the API for GPT 4.5 is really expensive (500x more expensive than their cheapest model?).
-
To the normal person: it's really difficult to judge the differences between these models. If DeepSeek can build a frontier-grade model for cheap, should other AI labs focus on optimisation instead of scaling up?
Does this mean that scaling is dead? I'm not so certain. It seems to me that:
-
Scaling vs Optimisation isn’t really a tradeoff. You need both in the long term.
-
Scaling is more deterministic. As long as you can deploy capital, you can keep scaling and get predictable gains.
-
You do want to optimise, but optimisation is a search process that can lead to false starts, gains as well as losses.
-
Each ‘clever’ hypothesis that you try to validate by doing a model run is time wasted, and there is an opportunity cost to this — you could just be doing a bigger training run with a ‘not so clever’ approach instead.
While labs are able to raise more capital and throw money at the problem: scaling will dominate. At some point capital will run out, and people will need to start optimising.
This dynamic somewhat reminds me of the classic ‘Worse is Better’ post.
OK, how is this topic relevant if you are not running a research lab, you just want to build AI products?
-
I think the same dilemma that AI labs are facing plays out in product companies, albeit at much smaller scales.
- For example: should you fine-tune a model to make it more efficient for your use case (and save on token cost) or should you use few-shot prompting instead?
-
Should you brute force your way to a working use case (eg: use the most cutting edge model, waste a lot of tokens), or should you optimise your implementation?
-
Observing others helps build our own intuitions on when to scale, and when to optimise.