GPT-4: 8 Models in One; The Secret is Out

Key Takeaway

The key takeaway is that GPT-4 is actually an ensemble of 8 separate 220-billion parameter models rather than one single giant model. This mixture of experts approach allows each sub-model to specialize, combining to create one powerful model.


  • GPT-4 was thought to be a single giant 1 trillion+ parameter model, but has now been revealed to be composed of 8 smaller 220 billion parameter models.
  • This mixture of experts approach allows each sub-model to specialize in certain tasks/domains, with an orchestrator coordinating between them.
  • The methodology is well-established and is known as a hydra or mixture of experts model.
  • Combining multiple models in this way allows GPT-4 to match or even exceed the capabilities of a single giant model.
  • The approach has advantages like being more memory efficient and allowing different experts to handle different inputs best suited to their specialty.
  • There is still secrecy around many details of GPT-4 like the exact model sizes, architectures, and how queries are routed to different experts.
  • GPT-4 has shown major improvements over GPT-3, especially in areas like conversation and human-like writing.
  • The model mixing strategy points to a trend of ensembling groups of smaller models rather than creating single behemoth models.

Read More

Related post

GraphRAG: Unlocks LLM Potential for Analyzing "Secret" Data

GraphRAG, a novel approach developed by Microsoft Research, significantly improves the ability of Large Language Models (LLMs) to analyze and answer questions about unseen data (private datasets). This is achieved by using LLMs to create knowledge graphs from the data, which are then leveraged to enhance the retrieval and generation…