Dec 12, 2023

GPT-4: 8 Models in One; The Secret is Out

Key Takeaway

The key takeaway is that GPT-4 is actually an ensemble of 8 separate 220-billion parameter models rather than one single giant model. This mixture of experts approach allows each sub-model to specialize, combining to create one powerful model.

Summary

GPT-4 was thought to be a single giant 1 trillion+ parameter model, but has now been revealed to be composed of 8 smaller 220 billion parameter models.
This mixture of experts approach allows each sub-model to specialize in certain tasks/domains, with an orchestrator coordinating between them.
The methodology is well-established and is known as a hydra or mixture of experts model.
Combining multiple models in this way allows GPT-4 to match or even exceed the capabilities of a single giant model.
The approach has advantages like being more memory efficient and allowing different experts to handle different inputs best suited to their specialty.
There is still secrecy around many details of GPT-4 like the exact model sizes, architectures, and how queries are routed to different experts.
GPT-4 has shown major improvements over GPT-3, especially in areas like conversation and human-like writing.
The model mixing strategy points to a trend of ensembling groups of smaller models rather than creating single behemoth models.