While foundation models are often perceived as costly and inference bills are rising, the reality is that the cost of intelligence is in free fall — and this trend shows no signs of slowing. Angel investor Elad Gil shared this chart on the cost of GPT-4 equivalent intelligence from Open AI, which has fallen 240x in the last 18 months, from $180 per million tokens to less than $1.
Several factors are driving the continued decline in the cost of AI intelligence. If this trend continues, we may reach a point where the cost of using AI models becomes negligible for most use cases, effectively approaching zero.
In this piece, we'll explore the key factors contributing to the declining cost of intelligence.
1. Competition and market forces
Competition has driven down costs as more companies enter the market, with open-source models from Meta and others now matching GPT-4's performance. This competitive pressure exists between model developers and the inference providers who run these models.
For example, the rate sheet below shows the variety of provider options for Llama 3.1.
Similarly, for a given level of model quality, users have multiple options, which results in pricing pressure and some degree of convergence in prices, as in the chart below.
Even with OpenAI's recent release of o1, historical patterns suggest other models will eventually catch up. While costs may temporarily increase due to the model's higher computational demands at inference time, prices should ultimately continue their downward trend.
2. Improving efficiency in compute
Obviously, market forces alone aren’t enough to lower prices. Optimizations at both the hardware and infrastructure layers have reduced the cost of running inference, allowing companies to pass these savings on to customers.
Significant hardware innovation is driving costs down, with specialized chips/ASICs like Amazon's Inferentia and new players like Groq. While these solutions are still emerging, they're already demonstrating dramatic improvements in both price and speed.
Amazon says their Inferentia instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 options.
Similarly, as inference workloads are starting to scale up, and more talent is building in AI, we’re getting better at utilizing GPUs more effectively, and getting more economies of scale and lower inference costs through optimizations at the software layer as well.
3. The rise of smaller, smarter models
Another key reason for the declining cost of AI is the improvement in performance for a given level of model size — and smaller models are getting smarter over time.
Here’s one example: Meta’s Llama 3 8B model essentially performs on par (or better than) their Llama 2 70B model, which was released a year earlier. So within a year, we got a model nearly one-tenth the parameter size that had the same performance. Techniques like distillation (using the outputs of larger models to fine-tune smaller, task-specific models) and quantization are making it possible to create increasingly capable compact models.
Notably, Llama 3.1 405B's license permits using its outputs to fine-tune other AI models, which further enables this trend.
According to Llama’s community license agreement, “If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include ‘Llama’ at the beginning of any such AI model name.
4. New model architectures driving efficiency
There’s also a push toward entirely new model architectures that promise to make AI even more efficient. While transformer models still dominate, new architectures such as state space models are emerging as strong contenders, with companies such as Cartesia leading the way.
These new approaches can be faster and more efficient, and they require less compute power to achieve comparable performance. As companies make more progress on these approaches, they could enable small, highly efficient and robust models that further reduce the cost of intelligence through lower inference costs.
As an example, some of the Mamba class of models in the 1.5B-3B parameter range outperform the leading transformer-based models of that size.
5. On-device inference
The future of AI isn’t just about cloud-based models — it’s increasingly about running AI directly on end-user devices. Apple has already announced two proprietary 3B parameter models that will run locally as part of their Apple Intelligence launch: one language model and one diffusion model (more details). While these models are currently limited to Siri and other first-party apps, Apple will likely open them to developers in the future.
Apple’s benchmarks show that while on-device models can’t be used for all queries, users tend to prefer them for many prompts, even compared to larger models, as below.
In addition, as the chips on laptops and phones continue to get more powerful, coupled with models getting smaller and smarter as discussed above, a larger fraction of the most commonly used needs for intelligence can be handled with models running locally.
For example, companies like Ollama now enable users to run popular models, including Llama 3-8B, directly on their laptops.
Beyond offering users privacy and reduced latency, local processing will dramatically reduce costs. When AI runs directly on our devices, the cost of intelligence will effectively drop to zero.
Look beyond today’s intelligence costs
As the price of AI continues to drop, we’ll see a wave of new applications and industries embracing these technologies. My advice to founders and builders is not to focus too much on inference costs (as long as they aren’t causing significant cash burn) and to avoid over-optimization too early, as these costs are dropping rapidly.
Instead, I encourage founders and builders to think about what use cases or additional features don’t seem feasible yet, but are potentially unlocked as the cost of intelligence drops to one-tenth or one-hundredth of the current price, since we’ll likely get there sooner than most people think.
If you’re building a company that helps drive down these inference costs across any layer or leverages the future of low-cost intelligence to solve problems for end users, I’d love to hear from you - reach me on X.