← blog

variable heads self-organize

Variable attention heads and learned frequencies — conceptual illustration
Head specialization — conceptual

2026 3 30 && ~8 min read && experiments && architecture

By Badaramoni Avinash

In a standard multi-head attention layer, every head has the same dimensionality. If you have 12 heads and a model dimension of 768, each head gets exactly 64 dimensions. This is an arbitrary choice — a convenience, not a principle.

In Wave Field, the wave parameters create natural specialization. We wanted to know: if different heads are already learning different functions, should they all be the same size?


What trained heads actually learn

After training a 12-head Wave Field model, we examined the learned wave parameters. The pattern was consistent across runs:

ω=0.23Head 0
ω=0.58Head 1
ω=0.89Head 2
ω=1.11Head 3
ω=1.47Head 4
ω=1.82Head 5
ω=1.98Head 6
ω=2.57Head 7
ω=2.84Head 8
ω=2.86Head 9
ω=2.77Head 10
ω=4.27Head 11

Learned frequencies after training — blue is slow (long-range), red is fast (local)

The heads separate into two clear groups. Low-frequency heads (roughly ω < 1.5) learn slow, far-reaching wave kernels. They connect tokens across paragraphs, track document-level topics, and maintain long-range coherence. High-frequency heads (ω > 2.0) learn rapid, quickly-damped kernels. They handle local grammar, word agreement, punctuation, and syntactic structure.

Nobody assigns these roles. The training loss alone drives the frequencies apart. This is self-organization: the physics of wave propagation creates a natural pressure for heads to diversify, because heads with identical frequencies would be redundant.


Low-frequency heads

Head 0 has a learned frequency of ω = 0.23 and damping of α = 0.17. Its wave kernel decays slowly — it reaches thousands of tokens before the amplitude drops below 1% of its peak. In effect, this head "sees" the entire document.

When we probed its behavior on long documents, we found it tracks high-level structure: the topic of a paragraph, the referent of a pronoun five sentences back, the thematic arc of an argument. These are exactly the connections that standard attention handles through explicit position-to-position comparison — but here they emerge from a single slow wave.

What long-range connections look like

Consider a passage where a concept introduced in paragraph one is referenced again in paragraph four. In standard attention, this requires the model to maintain and compare key-value pairs across hundreds of intervening tokens. In Wave Field, the low-frequency head's wave physically propagates the signal from paragraph one to paragraph four. The connection is not computed — it is carried.


High-frequency heads

Head 11 has a learned frequency of ω = 4.27 and damping of α = 1.20. Its wave oscillates rapidly and decays within a few dozen tokens. This head has almost no long-range influence — and it does not need any.

Local grammar is inherently short-range. Subject-verb agreement, article-noun pairing, comma placement, clause boundaries — these patterns rarely span more than 10-20 tokens. A high-frequency, heavily-damped wave kernel is the natural fit. The head attends strongly to immediate neighbors and ignores distant context.


Variable head dimensions

Given that heads naturally specialize into different roles, we tested whether they also benefit from different sizes. The hypothesis: long-range heads need more representational capacity (more dimensions) because they encode complex document-level relationships, while local grammar heads can work with fewer dimensions because their patterns are simpler.

We compared two configurations, keeping the total parameter count approximately equal:

Configuration Heads Total dims Perplexity
Uniform 12 × 64 768 24.7
Variable 4×32 + 4×64 + 4×96 768 24.5

The variable configuration assigns 96 dimensions to the 4 lowest-frequency heads (long-range), 64 dimensions to the 4 mid-frequency heads, and 32 dimensions to the 4 highest-frequency heads (local). The result: equivalent quality. The variable-head model matches the uniform model's perplexity within noise.

This is a useful negative result. It means the architecture is robust to head sizing — the wave parameters dominate the head's behavior, not its dimensionality. You can reallocate dimensions freely without degrading the model.


Why self-organization matters

In standard transformers, head specialization is a well-known phenomenon — some heads attend locally, some attend to specific positions, some appear to track syntax. But this specialization is fragile, hard to predict, and difficult to interpret. Researchers discover it through post-hoc analysis of attention matrices.

In Wave Field, specialization is legible. Each head's role is encoded in three numbers. You can read the frequency and know immediately whether a head is handling long-range or local patterns. You can compare damping values and understand how far each head's influence reaches. The wave parameters are not just a computational mechanism — they are a description of what each head has learned to do.

This legibility has practical consequences. If a model underperforms on long-range tasks, you can inspect the low-frequency heads directly. If local grammar is weak, you can examine the high-frequency heads. The architecture's internal structure is transparent in a way that attention matrices are not.


The physics forces diversity

Two heads with identical wave frequencies would produce identical attention patterns. Gradient descent naturally pushes them apart, because identical heads waste capacity — the loss improves when they differentiate. But the wave parameterization makes this differentiation smooth and continuous. A small change in ω produces a small change in the head's behavior, which produces a small change in the loss. There are no sharp transitions or mode collapses.

This is the advantage of parameterizing attention through physics rather than through learned weight matrices. The inductive bias of wave propagation does not just enable efficient computation — it organizes the model's internal representations into interpretable, specialized components.