A Single Layer to Explain Them All:
Understanding Massive Activations in Large Language Models

1Rutgers University
2Wake Forest University
3Meta AI
Massive activation after FFN

Massive activation will appear in MELayer after FFN.

Figure 2

This is the schematic diagram of WeMask.

Abstract

We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the Massive Emergence Layer (ME Layer), that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.



Emergence of ME Layer


Emergence of ME Layer

We analyze Massive activations emerge only at the ME Layer driven by unusually large and directionally aligned RMSNorm and FFN parameters that selectively amplify the massive-activation token

The Direction of Massive Activation


The Direction of Massive Activation

Once the massive activation emerges at the ME Layer the massive activation’s hidden state exhibits strong input-invariant directionality and remains stable across subsequent layers.

Experimental Results


We propose WeMask to mitigate rigid representations in hidden states and improve LLM performance. We evaluate the effectiveness of WeMask on various tasks, including instruction following and math reasoning, in both training free and fine tuning settings.

Confidence Average

We use WeMask to fine tuning the model on Math tasks and safety tasks, shows that WeMask can improve model's performance.


Confidence Average

We also test the performance of WeMask on general tasks. It shows that WeMask also has positive effect on such tasks.



Additionally, we visualize the emergence of attention sinks before and after the ME Layer using heatmaps, and further illustrate how WeMask mitigates the attention sink phenomenon.

Confidence Average

Results show that WeMask can mitigate the attention sink, which provide a new insight that compared to completely eliminating attention sinks, attenuating their influence may be a more practical and effective solution.



BibTeX