A Single Layer to Explain Them All:
Understanding Massive Activations in Large Language Models

Zeru Shi¹, Zhenting Wang¹, Fan Yang², Qifan Wang³, Ruixiang Tang¹

¹Rutgers University

²Wake Forest University

³Meta AI

Paper arXiv Code Model

Massive activation will appear in MELayer after FFN.

This is the schematic diagram of WeMask.

Abstract

We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the Massive Emergence Layer (ME Layer), that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

Emergence of ME Layer

We analyze Massive activations emerge only at the ME Layer driven by unusually large and directionally aligned RMSNorm and FFN parameters that selectively amplify the massive-activation token

The Direction of Massive Activation

Once the massive activation emerges at the ME Layer the massive activation’s hidden state exhibits strong input-invariant directionality and remains stable across subsequent layers.

Experimental Results

We propose WeMask to mitigate rigid representations in hidden states and improve LLM performance. We evaluate the effectiveness of WeMask on various tasks, including instruction following and math reasoning, in both training free and fine tuning settings.

We use WeMask to fine tuning the model on Math tasks and safety tasks, shows that WeMask can improve model's performance.

We also test the performance of WeMask on general tasks. It shows that WeMask also has positive effect on such tasks.

Additionally, we visualize the emergence of attention sinks before and after the ME Layer using heatmaps, and further illustrate how WeMask mitigates the attention sink phenomenon.

Results show that WeMask can mitigate the attention sink, which provide a new insight that compared to completely eliminating attention sinks, attenuating their influence may be a more practical and effective solution.

A Single Layer to Explain Them All: Understanding Massive Activations in Large Language Models