There are too many floats now and no one knows what they are. Everyone knows about 64- and 32-bit floats but over the past 10 years all these “new” types have emerged. You start mentioning brain floats to someone and you have lost the audience. If you work in ML or even if you primarily work on serving models for inference you need to understand these data types. Grab hold of your exponents and let’s see where the mantissa takes us.

Classic IEEE 754

We must understand what a floating point is and how CPUs have been using them. In 1985 IEEE established the standard for floating point numbers. All floats comprise three things

1) Sign bit

2) Exponent

3) Mantissa

flowchart TD A[Floating Point] --> B[Sign Bit] A --> C[Exponent Bits] A --> D[Mantissa Bits] B --> B1["1 bit
0=Positive, 1=Negative"] C --> C1[Biased exponent for range] D --> D1[Fractional precision]

The sign is 1 bit that is either 0 (positive) or 1 (negative)

The exponent controls the range (how small or large) of the number. The more exponents the wider the range.

The mantissa controls the precision of the number. It stores the fractional part. More mantissa bits means you will have a more precise value.

Examples of 64 and 32 bit numbers

Here are what some examples would look like. These are the types of computations that CPUs have been doing for a very long time.

The IEEE 754 formula

IEEE 754 standard states that the value of a binary32 (FP32) number is:

value = (-1)^sign × 2^(exponent - bias) × 1.mantissa

Where:

  • sign = bit 31 (0 = positive, 1 = negative)
  • exponent = the 8-bit unsigned integer stored in bits 30-23
  • bias = 127 for FP32, 1023 for FP64
  • mantissa = the 23 fraction bits (bits 22-0), with an implicit leading 1 for normal numbers

This formula is like scientific notation in binary: (-1)^sign × 1.m × 2^(E-bias).

FP32 bit layout examples

0.15625 in FP32

0 | 01111100 | 01000000000000000000000

sign exponent (8 bits) fraction (23 bits)
0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 30 23 22 0

= 0.15625

Applying the formula:

  • sign = 0(-1)^0 = +1
  • exponent = 01111100₂ = 1242^(124 - 127) = 2^(-3) = 1/8
  • mantissa = .0100...0₂ = 1/41 + 1/4 = 1.25

(+1) × 1.25 × 1/8 = 0.15625

-2.5 in FP32

1 | 10000000 | 01000000000000000000000

sign exponent (8 bits) fraction (23 bits)
1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 30 23 22 0

= -2.5

Applying the formula:

  • sign = 1(-1)^1 = -1
  • exponent = 10000000₂ = 1282^(128 - 127) = 2^1 = 2
  • mantissa = .0100...0₂ = 1/41 + 1/4 = 1.25

(-1) × 1.25 × 2 = -2.5

FP64 examples

For FP64, the formula is the same but with bias = 1023:

value = (-1)^sign x 2^(exponent - 1023) x 1.mantissa

0.15625 in FP64

0 | 01111111100 | 010000...000000

sign exponent (11 bits) fraction (52 bits)
0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
63 62 52 51 0

= 0.15625

Applying the formula:

  • sign = 0 -> (-1)^0 = +1
  • exponent = 01111111100₂ = 1020 -> 2^(1020 - 1023) = 2^-3 = 1/8
  • mantissa = .0100...0₂ = 1/4 -> 1 + 1/4 = 1.25

(+1) x 1.25 x 1/8 = 0.15625

-2.5 in FP64

1 | 10000000000 | 010000...000000

sign exponent (11 bits) fraction (52 bits)
1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
63 62 52 51 0

= -2.5

Applying the formula:

  • sign = 1 -> (-1)^1 = -1
  • exponent = 10000000000₂ = 1024 -> 2^(1024 - 1023) = 2^1 = 2
  • mantissa = .0100...0₂ = 1/4 -> 1 + 1/4 = 1.25

(-1) x 1.25 x 2 = -2.5

Now that we understand this a bit better we can move onto all these new types that have become so popular with GPUs and ML

GPUs get all the floats

Let’s break down these different types and when they released.

Release Dates

Type Year Hardware Notes
FP64, FP32 1985 IEEE 754 the originals
FP16 2002 GeForce FX first GPU float16 (graphics only)
FP16 2016 Pascal GP100 real FP16 ML compute support
BF16 2017 Google TPU v2 ML focused 16-bit format
BF16 2020 Ampere / Cooper Lake widespread GPU/CPU adoption
TF32 2020 Ampere A100 FP32 speedup
FP8 2022 Hopper H100 8-bit training
FP4 2024 Blackwell B200 4-bit inference

Floating Point Breakdowns

Type Sign Exponent Mantissa Total Bits
FP64 / double 1 11 52 64
FP32 / float 1 8 23 32
TF32 1 8 10 19
BF16 1 8 7 16
FP16 / half 1 5 10 16
FP8 (E4M3) 1 4 3 8
FP8 (E5M2) 1 5 2 8
FP4 1 2 1 4

Tensor Float 32 (TF32)

Around 2017 Nvidia started making GPUs with Tensor Cores (Volta V100), and with Ampere (A100) in 2020 they added TF32 support. These are in total 19 bits and they have an accumulator that accumulates back to a float32. The loss of precision does not matter for training and the loss is pretty small. Faster training speeds is worth the precision.

Input:  FP32 (23-bit mantissa)
           ↓
Multiply: TF32 (10-bit mantissa)
           ↓
Accumulate: FP32 (23-bit mantissa)

The way this works is that the precision is dropped to 10 mantissa at multiply but when adding products together that is done at the full 23 bit mantissa. The GPU has specific tensor cores that are created and optimized for these calculations. Below is a more complete example.

1) 2.5 and 4.5 enter as FP32 (32-bit fields on the wire)
2.5
0 1 0 0 0 0 0 0 0 0 1 0 0 ...0s (20 more bits)
4.5
0 1 0 0 0 0 0 0 1 0 0 1 ...0s (20 more bits)
sign exponent kept (10-bit mantissa) ignored (13 bits)
2) Multiply reads only the purple bits (first 10 mantissa bits)
  TF32 multiply: 2.5 × 4.5 = 11.25
3) Each K-step feeds one product into the FP32 accumulator (hardware adder on die):
K-step Multiply Partial Product acc before acc after Precision
0 2.5×4.5 11.25 0.0 11.25 FP32
1 3.1×2.2 6.82 11.25 18.07 FP32
2 1.7×0.9 1.53 18.07 19.60 FP32
15 0.4×8.1 3.24 ... final dot product FP32

Since the accumulator keeps full FP32 precision during addition you limit the precision loss. For scientific computing this would not be a good idea but for ML training this precision is rarely worth the cost for larger multi-billion param models.

Brain Float 16 (BF16)

BF16 and FP16 are the exact same amount of bits so what gives? The range of the numbers for BF16 is greater than FP16 but with lesser precision. BF16 has 8 exponents and FP16 has 5.

The primary reason for BF16 was to make training easier by reducing the need to apply loss scaling when training large models (billions of params)

To protect FP16 from underflow and overflow during backprop you need to scale the values. The reason you have to protect against this is because for large models the logits can get very large and overflow the range of FP16. The other issue is that the vanishing gradient issue can pop up when training the large models. So you need to scale the loss which looks something like this

loss_scaled = loss × S              # one scalar, cheap

# backprop runs: ALL gradients are now ×S in size
# FP16 can represent them because they're no longer tiny

# BEFORE optimizer.step():
for grad in model.parameters():
    grad /= S                       # done in FP32, exact
    
optimizer.step()                    # sees correct-magnitude gradients

Using Pytorch this is done by using GradScaler

scaler = torch.amp.GradScaler("cuda")   # manages S automatically

for batch in dataloader:
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        loss = model(batch)            # forward in FP16

    scaler.scale(loss).backward()      # loss × S, then backprop (grads are ×S)
    scaler.unscale_(optimizer)         # divides all grads by S in FP32
    scaler.step(optimizer)             # optimizer sees correct magnitudes
    scaler.update()                    # adjusts S up or down for next step

The problem with this is sometimes its very hard to find the right value to scale by. Many times you can solve overflow but not underflow. In order to solve this BF16 comes to the rescue.

BF16 has the same range as FP32 but with less precision. This is well worth the trade off since now we don’t have to apply scaling and deal with these issues. Our gradients can flow more properly.

There are some instances where you may prefer the extra precision of FP16 such as after pre-training and supervised training. The weights are more stable and maybe if you want to do reinforcement learning you may want to switch to FP16 for extra precision. This is not always true and many people have used BF16 for RL tuning.

Floating Point 8 (FP8) and Floating Point 4 (FP4)

FP8 comes in two versions

  • E4M3 (4 exponent, 3 mantissa)
  • E5M2 (5 exponent, 2 mantissa).

Both are needed because activations and gradients have different requirements:

  • E4M3 (range ±448): More precision for weights and activations (forward pass).
  • E5M2 (range ±57344): More range for gradients (backward pass). Gradients span many orders of magnitude and need the wider range to avoid overflow.
FP8 E4M3 (1 sign | 4 exponent | 3 mantissa): weights and activations, max value 448
sign exponent (4 bits) mantissa (3 bits)
S E E E E M M M
7 6 3 2 0
FP8 E5M2 (1 sign | 5 exponent | 2 mantissa): gradients, max value 57344
sign exponent (5 bits) mantissa (2 bits)
S E E E E E M M
7 6 2 1 0

The H100 introduced FP8 tensor cores. Training with FP8 follows the same mixed-precision pattern as TF32

FP8 mixed-precision training flow
Forward:
master weights
FP32
cast + scale
FP8 E4M3
activations
FP8 E4M3
×
weights
FP8 E4M3
matmul
FP8 multiply
accumulate
FP32
loss
FP32
Backward:
gradients
FP8 E5M2
dequantize
FP32
optimizer
FP32
update
FP32 master
scale = FP8_max / tensor_amax  •  dequantize = result / scale  •  forward uses E4M3  •  backward uses E5M2

The major new piece that allows this to work is per tensor scaling. You need to find a scale value for each tensor in order to fit it into the range for FP8

# per-tensor scaling for FP8
amax = input.abs().max()              # largest value in the tensor
scale = 448.0 / amax                   # map max → FP8 max (448 for E4M3)
scaled = (input * scale).to(torch.float8_e4m3fn)   # quantize to FP8

The scale factor is stored alongside the FP8 tensor so the next layer knows how to interpret it. PyTorch handles this automatically with torch.fp8 APIs.

With newer NVIDIA cards like H100 you can actually train in FP8. You don’t even need to worry about the loss scaler since all tensors get scaled properly.

Floating Point 4 (FP4)

FP4 is experimental. It can really save you memory but in my experience some models start to show degradation at this low precision. If the model was trained using quantized aware training then it can still be pretty good. I hope we continue to research and try new things at these lower precision data types.

Here is the standard format

E2M1 (2 exponent, 1 mantissa). With 4 bits total, you can represent exactly 16 values:

FP4 E2M1 (1 sign | 2 exponent | 1 mantissa)
sign exponent (2 bits) mantissa (1 bit)
S E E M
3 2 1 0
All 16 representable values:
Bits Value Bits Value
0 00 0 +0.0 1 00 0 −0.0
0 00 1 +0.5 1 00 1 −0.5
0 01 0 +1.0 1 01 0 −1.0
0 01 1 +1.5 1 01 1 −1.5
0 10 0 +2.0 1 10 0 −2.0
0 10 1 +3.0 1 10 1 −3.0
0 11 0 +4.0 1 11 0 −4.0
0 11 1 +6.0 1 11 1 −6.0

So can you train in FP4? Not really or at least not yet. We lose so much precision that when you compute gradients you will run into too much underflow/overflow. This mode is meant to be used for inference only at the moment.

The model is trained in BF16/FP16 and then quantized and scaled to 4bit for inference.

There is some research on FP4 training that uses similar techniques to FP8. In a practical sense it is still too unstable for adoption.

Below are some papers about attempting training at FP4:

I think we will have large adoption of FP4 and other smaller precisions in the future.

Conclusion

Format Best for Hardware Range Notes
FP64 Scientific computing, numerical simulation CPUs, HPC GPUs (A100, H100, B200) ±1.8 × 10³⁰⁸ Overkill for ML: use only when double precision is required
FP32 Master weights, loss scaling, accumulator All hardware ±3.4 × 10³⁸ The reference precision. Everything casts down from here
TF32 Training throughput on NVIDIA Tensor Cores Ampere+ (A100, H100, B200) ±3.4 × 10³⁸ Transparent drop-in for FP32 in PyTorch: free speed on compatible GPUs
BF16 Training large models (7B+) TPU v2+, Ampere+ GPUs ±3.4 × 10³⁸ Best range/precision trade-off for training. No loss scaling needed
FP16 Fine-tuning, inference, smaller models Most GPUs (Pascal+) ±65,504 More precision than BF16 but less range: requires GradScaler for training
FP8 (E4M3) Forward pass, inference quantization Hopper+ (H100, B200) ±448 Weights and activations during FP8 training
FP8 (E5M2) Gradients during FP8 training Hopper+ (H100, B200) ±57,344 Extra range needed for backprop
FP4 Inference quantization Blackwell B200 (experimental) ±6 Training still research-only. Use for memory-bound inference

Now we have a much better understanding of how these data types work and how they can help ML training. GPUs keep improving and the hardware can be pretty complicated.

Keep these data types in mind when you have to purchase hardware for yourself or your company. You want hardware that supports your training/inference needs. Good luck and maybe try some FP4 training on your own.