Mixture-of-Experts (MoE) Overview

May 18, 2024

Home     About Me     non-technical
MoE Architecture

"[MoE] consists of a number of experts, each a simple feed-forward NN, and a trainable gating network which selects a sparse combination of the experts to process each input (Shazeer, 2017)...[the MoE selects] a potentially different combination of experts at each [token]. The different experts tend to become highly specialized based on syntax and semantics."

Key ideas used in the MoEs are sparsity, parallelism, and meta-learning.

Here is a single MoE block and an example gating network inside of it.

MoE Block


MoE Benefits:


1. Parallelism

2. Efficient Parameter Usage with Sparsity

3. Meta-Learning

4. Energy-Efficiency

Miscellaneous:

Balancing of Expert Importance:

Preventing Synchronization Overheads:

Downsides:

Higher Risk of Training Instability:

Need Large Batch Sizes:

Need Many GPUs:

Closing Thoughts:

Scaling in low-resource environments

I don't think scaling is limited only to compute-rich places like google, meta, openai, etc. At the heart of scaling techniques is the effective usage of parallelism and GPU programming that leverages the characteristics of GPUs. And low-compute environments are places that need to pay attention to these the most.

It's easy to get discouraged by the sheer number of parameters these MoE models carry, but they carry ideas that we should be asking in academia:


References

[1] Shazeer et al. 2017. "Outgrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer."