← blog

how wave propagation replaces attention

Wave propagation — conceptual visualization
Wave interference — deep field, pale crests (conceptual)

2026 3 30 && ~12 min read && interactive

By Badaramoni Avinash

Every large language model today — GPT, Claude, LLaMA, Mistral — uses the same core mechanism: self-attention. Each token compares itself with every other token in the sequence. This is powerful, but it scales quadratically: O(N²).

Wave Field LLM replaces this with something fundamentally different. Tokens don't compare with each other at all. Instead, they deposit information onto a continuous field, and wave physics propagates that information. The result is O(N log N) complexity — and it changes what's computationally possible.


The cost of comparing everything

In standard attention, every token must attend to every other token. The number of operations grows with the square of the sequence length. Drag the slider to see how this scales:

Attention matrix vs wave field — drag to resize
2K
Standard O(N²)
4M
operations
Wave Field O(N log N)
22K
operations

Wave Field requires 182× fewer operations

At 2,048 tokens, the difference is modest. At 128K tokens, standard attention needs 16 billion operations. Wave Field needs 2.2 million. At 1M tokens, standard attention is physically impossible — no computer can store the N×N matrix. Wave Field runs fine.


Scatter, convolve, gather

Instead of direct token-to-token comparison, Wave Field attention works in three steps:

Wave field pipeline

Tokens deposit their values onto a continuous 1D field via bilinear interpolation.

The key operation — convolution — uses the Fast Fourier Transform, which is inherently O(N log N). There is no attention matrix. No softmax. No quadratic bottleneck.


The wave kernel

Each attention head has a learnable wave kernel, defined by just three parameters:

k(t) = exp(−α·t) · cos(ω·t + φ)

ω controls frequency — how fast the wave oscillates. α controls damping — how far the signal travels. φ controls phase — head diversity.

Wave kernel explorer
1.5
0.30
0.00

A medium-frequency wave with moderate reach — connects sentences and paragraphs.

A head with low ω and low α produces a slow, far-reaching wave — useful for connecting tokens thousands of positions apart. A head with high ω and high α oscillates rapidly and decays quickly — useful for local grammar. The model learns the right frequencies during training.

What 12 heads look like

A trained Wave Field model has 12 attention heads, each with different wave parameters. Click any head to see its kernel:

Multi-head attention explorer — click a head

Head 0 — ω=0.23, α=0.17 — slow wave, long-range connections between distant paragraphs

Notice how the heads self-organize: early heads (0-3) have low frequency for long-range connections, while later heads (9-11) have high frequency for local grammar. This emerges naturally during training — nobody assigns these roles.


Measured performance

These are not theoretical projections. They are measured on real hardware, running the same 130M-parameter model in float32.

Throughput comparison — same model, same GPU

On an RTX 5090 — a consumer GPU that costs $2,000 — Wave Field processes 256K tokens while standard attention cannot even handle 32K. The throughput doesn't degrade: Wave Field maintains 157K tokens/sec from 32K through 256K context.


Memory

Standard attention stores a key and value vector for every token. This KV cache grows linearly with sequence length and dominates GPU memory at long contexts.

Wave Field has no KV cache. It stores a field — a compressed wave representation. At 1 million tokens:

Memory usage — drag slider to increase context
32K

Standard needs 5.3× more memory


What this enables

Because Wave Field runs at long context on consumer GPUs, it unlocks capabilities that are impossible with standard attention at the same hardware budget:

A cluster of 8× RTX 5090 GPUs — costing $16,000 total — can train a Wave Field model at 256K context. The same training on standard transformers would require 8× H100 GPUs costing $240,000, and would still be limited to shorter context due to the quadratic memory wall.

The architecture scales. We have verified it up to 512K tokens on a single H100.