GPT-4: 8 Models in One; The Secret is Out
Key Takeaway
The key takeaway is that GPT-4 is actually an ensemble of 8 separate 220-billion parameter models rather than one single giant model. This mixture of experts approach allows each sub-model to specialize, combining to create one powerful model.
Summary
- GPT-4 was thought to be a single giant 1 trillion+ parameter model, but has now been revealed to be composed of 8 smaller 220 billion parameter models.
- This mixture of experts approach allows each sub-model to specialize in certain tasks/domains, with an orchestrator coordinating between them.
- The methodology is well-established and is known as a hydra or mixture of experts model.
- Combining multiple models in this way allows GPT-4 to match or even exceed the capabilities of a single giant model.
- The approach has advantages like being more memory efficient and allowing different experts to handle different inputs best suited to their specialty.
- There is still secrecy around many details of GPT-4 like the exact model sizes, architectures, and how queries are routed to different experts.
- GPT-4 has shown major improvements over GPT-3, especially in areas like conversation and human-like writing.
- The model mixing strategy points to a trend of ensembling groups of smaller models rather than creating single behemoth models.