Complex Numbers and Deep Learning Part 1

No one knows what an imaginary number is. It’s like asking ML people how this algorithm is meant to work outside of a notebook. That reality just does not exist. Imaginary numbers are very powerful for certain areas of engineering. Deep learning algorithms can also use imaginary numbers but often do not due to limitations of back propagation. There are ways to get around this and you can keep the natural relationship between real numbers and imaginary. Keeping this relationship together can have the model learn more efficiently and generalize better. Before we get into all that we need to understand what an imaginary number is and what a complex number is.

Table of Contents

Complex Numbers
Complex Numbers and ML
Wirtinger Calculus has Entered the Fight
Complex Conclusion
- Signals Data Preview

Complex Numbers

Imaginary numbers are a bad name. Imaginary makes it seem like it’s not real but it is. This is in some ways hard to grasp like negative numbers were hard to grasp for many in mathematics. Which makes sense because if you have 3 apples and you take 4 apples away you now have -1 apples. How in the world can you have -1 apples? It sounds absurd but that doesn’t mean negative numbers are not real. In banking it makes a lot of sense. You are in the hole and have negative dollars. You gotta get out of that hole.

The world’s most trusted source, Wikipedia, defines an imaginary number as: “the product of a real number and the imaginary unit $i$, which is defined by its property $i^2 = -1$.”

An imaginary number is then any real multiple of $i$, written as:

\[bi \quad \text{where} \quad b \in \mathbb{R}\]

A complex number combines a real part and an imaginary part:

\[z = a + bi \quad \text{where} \quad a,b \in \mathbb{R}\]

These definitions don’t help at all. People can read that definition all day long but it won’t click.

I am approaching this concept from an engineering and ML perspective. I am not a pure mathematician who would probably disagree with this statement.

I view imaginary and complex numbers as rotations. In the practical sense the problems that deal with complex numbers have some relationship with rotations or cycles.

What I mean by that is in the practical world when you multiply a real number by an imaginary number you are rotating 90 degrees.

Multiplying a complex number by $i$ rotates it by $90^\circ$ counterclockwise:

\[i(a + bi) = -b + ai\]

In matrix form, this is the same as applying a 2D rotation by $90^\circ$:

\[\begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} -b \\ a \end{bmatrix}\]

Multiply by $i$ Visualization

This animation keeps the rotation fixed at $90^\circ$ so you can see how $(a,b)$ maps to $(-b,a)$.

a 1.0 b 1.0

Mapping: $(a,b) \rightarrow (-b,a)$

Polar Form

A complex number can be written in two equivalent ways:

Cartesian form: $z = a + bi$ (horizontal plus vertical parts).
Polar form: $z = re^{i\theta}$ (length plus direction).

They represent the same point in the complex plane:

\[z = a + bi = r(\cos\theta + i\sin\theta) = re^{i\theta}\]

where

\[r = |z| = \sqrt{a^2 + b^2}, \qquad \theta = \arg(z) = \operatorname{atan2}(b,a)\]

So $r$ tells you how far from the origin, and $\theta$ tells you the direction from the positive real axis.

Multiplying by $re^{i\theta}$ does two things at once: scale by $r$ and rotate by $\theta$.

Angles are periodic, so these all point the same direction:

\[\theta,\ \theta + 2\pi,\ \theta + 4\pi,\ \ldots,\ \theta + 2\pi k \quad (k \in \mathbb{Z})\]

Polar Form Visualization

This animation shows the same complex number in both forms: $z = a + bi$ and $z = re^{i\theta}$.

r 1.4 theta 45deg

Live values:

Rotation Visualization

Here is an animation that shows the original vector and the rotated version.

x 1.0 y 1.0 angle 45deg

Current rotation: 0 deg

I hope defining these terms and showing these animations help get the point across that complex numbers deal with rotating and scaling vectors.

Complex Numbers and ML

So if complex numbers contain rotation information then why are complex neural networks not that popular? It has to do with how back propagation works. Complex numbers do not propagate gradients well and the hardware is not optimized for this process.

Holomorphic vs Non-Holomorphic

For people new to this or those who forgot what they learned in calculus class like me, here is a simple way to define the terms:

Holomorphic: one consistent complex derivative in every direction.
Non-holomorphic: derivative estimate changes with direction.

The derivative test is:

\[f'(z) = \lim_{h\to 0} \frac{f(z+h)-f(z)}{h}\]

Holomorphic example: $f(z)=z^2$ gives one consistent derivative, $f’(z)=2z$.
Non-holomorphic example: $f(z)=\lvert z \rvert^2=z\overline{z}$ depends on both $z$ and $\overline{z}$, so the ordinary complex derivative is not valid for optimization.

Most ML losses are real-valued and behave like the second case, which is the root of the problem we will see next.

Direction Visual

This compares two tiny derivative estimates at the same point $z$:

blue uses $h=\varepsilon$ (real-axis step)
orange uses $h=i\varepsilon$ (imag-axis step)

If the arrows match, the derivative is direction-independent (holomorphic behavior).

Re(z) 1.0 Im(z) 0.8 epsilon 0.08

$f(z)=z^2$ gap: -

$f(z)=\lvert z \rvert^2$ gap: -

Why Complex Derivatives Are Tough in ML

ML losses are real-valued (like $L = \lvert w-a \rvert^2$), which means they are non-holomorphic. Ordinary complex calculus only gives us one derivative $f’(w)$, and it does not even exist for these losses. So we cannot blindly reuse real-number gradient descent on complex weights.

If you try to apply the ordinary complex derivative anyway, something bad happens: the imaginary-axis direction gets flipped. The real-axis move is fine, but the vertical move goes the wrong way, and loss goes up instead of down. The graph below shows this directly. Blue walks toward the target. Red drifts away because its imaginary step is flipped.

start Re(w) -1.2 start Im(w) 0.9 learning rate 0.22 steps 16

Step: 0 Loss (correct): - Loss (wrong): -

Wirtinger Calculus has Entered the Fight

So if most ML loss functions are non-holomorphic, are we out of luck?

No. Wirtinger calculus lets us treat a real-valued complex loss in a way that is fully consistent with real gradient descent.

Thankfully some crazy guy named Wilhelm Wirtinger introduced this calculus back in 1927. (Wirtinger Derivatives.)

Explaining Wirtinger

This smart math guy decided let’s treat the complex numbers as two independent variables. A real-valued loss depends on both $z$ and $\overline{z}$. Since $z = u + iv$, that means the loss really just depends on two real numbers: $u$ and $v$. That’s the major piece.

Because we have two numbers, we need two partial derivatives to describe the loss changes. Wirtinger calculus gives us the tools to do this:

\[\frac{\partial L}{\partial z} = \frac{1}{2}\left(\frac{\partial L}{\partial u} - i\frac{\partial L}{\partial v}\right), \qquad \frac{\partial L}{\partial z^*} = \frac{1}{2}\left(\frac{\partial L}{\partial u} + i\frac{\partial L}{\partial v}\right)\]

These are the two real partial derivatives ($\partial L/\partial u$ and $\partial L/\partial v$) shown in their complex form. One combines them with $-i$, the other with $+i$. That sign delta is the big enchilada, it’s what makes the magic happen.

For gradient descent on a real-valued loss, the update rule is:

\[z_{t+1} = z_t - \eta \frac{\partial L}{\partial z^*}\]

That is the only formula you need to remember. It looks like regular gradient descent but uses the conjugate derivative $\partial L/\partial z^{\ast}$.

If you unpack it, this is exactly the same as updating the real and imaginary parts separately:

\[u_{t+1} = u_t - \frac{\eta}{2}\frac{\partial L}{\partial u}, \qquad v_{t+1} = v_t - \frac{\eta}{2}\frac{\partial L}{\partial v}\]

The factor of $1/2$ is the price of using the compact complex notation instead of writing out the two real updates.

If you want to go deeper, Kreutz-Delgado’s paper walks through the full calculus with fancy examples.

Why the Wirtinger update uses $\partial L/\partial z^{\ast}$

So what’s the point of using $\partial L/\partial z^{\ast}$ and not $\partial L/\partial z$?

The conjugate derivative $\partial L/\partial z^{\ast}$ combines the real and imaginary partials with +i. The other derivative $\partial L/\partial z$ uses -i, which flips the imaginary step backward. We will see how the sign being different is the golden nugget.

Take a look at this example. Take $L = u^2 + v^2$ at the point $z = 1 + i$ (so $u=1, v=1$):

Real gradient: $\partial L/\partial u = 2$, $\partial L/\partial v = 2$ → step goes toward $(-1, -1)$
$\partial L/\partial z^{\ast} = 1 + i$ → update step $-\eta(1 + i)$ moves down-left (correct)
$\partial L/\partial z = 1 - i$ → update step $-\eta(1 - i)$ moves down-right (wrong, imaginary part flips sign)

The real split update moves $u$ by $-\frac{\eta}{2}\frac{\partial L}{\partial u}$ and $v$ by $-\frac{\eta}{2}\frac{\partial L}{\partial v}$. Perform one complex step and the math works out to:

\[\Delta z = -\frac{\eta}{2}\frac{\partial L}{\partial u} \;-\; i\,\frac{\eta}{2}\frac{\partial L}{\partial v}\]

Now expand the Wirtinger update $-\eta \frac{\partial L}{\partial z^*}$:

\[-\eta \cdot \frac{1}{2}\!\left(\frac{\partial L}{\partial u} + i\frac{\partial L}{\partial v}\right) = -\frac{\eta}{2}\frac{\partial L}{\partial u} \;-\; i\,\frac{\eta}{2}\frac{\partial L}{\partial v}\]

The $-\eta$ multiplied into the $+i$ flips it to $-i$, this is equivalent to the real split.

The only difference is the sign. If we were to take the wrong sign then we would move away from the loss. We would be adding and never reach the target.

Correct ($\partial L/\partial z^{\ast}$): $\;-\frac{\eta}{2}\frac{\partial L}{\partial u} \;\mathbf{-}\; i\frac{\eta}{2}\frac{\partial L}{\partial v}$

Wrong ($\partial L/\partial z$): $\;-\frac{\eta}{2}\frac{\partial L}{\partial u} \;\mathbf{+}\; i\frac{\eta}{2}\frac{\partial L}{\partial v}$

Correct (∂L/∂z*)

Δv = −η · ∂L/∂v

Wrong (∂L/∂z)

Δv = +η · ∂L/∂v (flipped!)

learning rate 0.15 steps 5

Wirtinger Solves Optimizing Loss Functions with Complex Numbers

This visual shows the loss $L(w) = \lvert w-a \rvert^2$ with a rotating target.

Wirtinger (correct): use $\frac{\partial L}{\partial w^*}$, so the step $w - \eta\frac{\partial L}{\partial w^*}$ matches the real-imag gradient. $w$ tracks the target and loss falls.
Naive complex derivative: use $\frac{\partial L}{\partial w}$, which is the conjugate. It flips the imaginary step, so $w$ drifts away and loss rises.

Both derivatives are defined because the calculus needs both to describe a non-holomorphic function. The visual shows which one corresponds to real gradient descent.

Why It Captures Rotation Better

Here is the part that connects back to the rotation theme of this whole article. This is the whole point. All this work to find a way to better capture rotation and geometric relationships in neural networks for problems that deal in complex numbers.

A complex weight $w$ carries both magnitude and angle. When the target $a$ sits at some angle, the error vector $w - a$ points in a specific direction. The Wirtinger update $\frac{\partial L}{\partial w^{\ast}} = w - a$ preserves that direction, so each step moves $w$ along the true line toward $a$.

The naive derivative $\frac{\partial L}{\partial w} = \overline{(w-a)}$ conjugates the error, which mirrors the angle across the real axis. That mirror flip is exactly the rotation problem: the step points in a direction that does not match the real geometry of the loss.

So Wirtinger derivatives “capture rotation” because they respect the complex structure of the error instead of silently flipping the sign and sending you spiraling away from the target.

Rotating Target Visual

We are almost done with all this math. This is starting to saturate my brain.

This plot shows the difference when using Wirtinger derivatives vs standard real derivatives. The target $a$ rotates around the origin, so the model has to track a moving angle. The blue path uses $\frac{\partial L}{\partial w^{\ast}}$ and follows the target. The red path uses $\frac{\partial L}{\partial w}$ and drifts away because the imaginary update is going in the wrong direction. For the purpose of the visual I clamp the standard derivative or else it falls off the plot.

target speed 1.0 learning rate 0.06 target radius 1.2

■ Wirtinger: tracks the rotating target

■ Naive: drifts away (imag axis flipped), clamped to plot

■ target a (rotating)

Loss (Wirtinger): - Loss (naive): -

Complex Conclusion

TL;DR

Complex numbers encode rotation and scaling together.
Real-valued ML losses are non-holomorphic.
Wirtinger calculus gives us two partial derivatives instead of one.
The conjugate derivative $\partial L/\partial z^{\ast}$ is the one that matches real gradient descent.
Keeping I and Q together lets the model learn the geometric relationship naturally.

The industry has decided that to keep things simple and easy for hardware and software, we drop the imaginary component and separate the real and complex parts, treating them as separate pieces.

Most complex problems have you getting rid of this geometric relationship and training on large amounts of data so that you eventually learn how the two components relate to each other.

I don’t think this is a bad idea. It takes advantage of our current software-optimized computations and hardware acceleration. It makes training and inference easier.

The disadvantage is you lose the natural geometric relationship. If you want to learn how complex numbers relate to scaling and rotating, then you need to keep them together and train using back prop with Wirtinger derivatives. That way you can keep the complex relationship intact.

There are many advantages to doing this. In the next article we will compare some complex neural networks to real neural networks in a problem domain that is complex. We will be analyzing IQ data of different signal types. Here is a preview of what we are looking at in the next article.

Signals Data Preview

This is the type of data that benefits from keeping the complex relationship together. Complex signals show up in radio, audio, and anywhere you need to model rotations and oscillations. Part 2 will train complex neural networks on IQ data, so here is a preview of what those signals actually look like.

Complex sinusoid. One complex number that spins in time:

frequency 1.0 amplitude 0.90 phase 0°

IQ impairments. Real received symbols are never perfect. The same constellation can look very different depending on what went wrong between the transmitter and receiver:

modulation: impairment strength 4

See you in part 2. Assuming you survived this math barrage