Mixture of Experts Explained
Key Takeaway:
Mixture-of-experts (MoE) models enable more compute-efficient pretraining and faster inference compared to dense models, but face challenges with generalization during fine-tuning. Recent work on instruction tuning shows promise for improving MoE fine-tuning.
Summary:
MoEs replace the feedforward layers in transformers with sparse MoE layers composed of experts and a gating network that routes tokens to experts. This enables scaling up models with less compute.
MoEs originated from work on conditional computation and ensembles. Milestones include 2017 work scaling LSTMs with sparsity and 2020 work scaling transformers in GShard.
Sparsity introduces challenges like uneven batch sizes. Solutions involve load balancing losses and setting expert capacity thresholds.
Recent Switch Transformers work addresses MoE training instability, uses fewer experts per token, and experiments with model compression techniques.
Encoder experts specialize in token types, while decoder experts show less specialization. More experts improve sample efficiency, but benefits diminish after 256-512 experts.
MoEs can overfit more during fine-tuning. Recent work shows promise with instruction tuning, which improves MoEs more than dense models.
MoEs suit high-throughput scenarios with ample memory. Dense models suit low-throughput use cases. Future work is exploring distillation and extreme quantization of MoEs.