From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

AuthorsKumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik

View publication

Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants with a single fine-tuning step, utilizing only 5B tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance.

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Related readings and updates.

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Training a Tokenizer for Free with Private Federated Learning

Discover opportunities in Machine Learning.