Explaining Floats: TF32, BF16, FP8, FP4, Huh?
There are too many floats now and no one knows what they are. Everyone knows about 64- and 32-bit floats but over the past 10 years all these “new” types have emerged. You start mentioning brain floats to someone and you have lost the audience. If you work in ML or even if you primarily work on serving models for inference you need to understand these data types. Grab hold of your exponents and let’s see where the mantissa takes us.
Classic IEEE 754
We must understand what a floating point is and how CPUs have been using them. In 1985 IEEE established the standard for floating point numbers. All floats comprise three things
1) Sign bit
2) Exponent
3) Mantissa
0=Positive, 1=Negative"] C --> C1[Biased exponent for range] D --> D1[Fractional precision]
The sign is 1 bit that is either 0 (positive) or 1 (negative)
The exponent controls the range (how small or large) of the number. The more exponents the wider the range.
The mantissa controls the precision of the number. It stores the fractional part. More mantissa bits means you will have a more precise value.
Examples of 64 and 32 bit numbers
Here are what some examples would look like. These are the types of computations that CPUs have been doing for a very long time.
The IEEE 754 formula
IEEE 754 standard states that the value of a binary32 (FP32) number is:
value = (-1)^sign × 2^(exponent - bias) × 1.mantissa
Where:
- sign = bit 31 (0 = positive, 1 = negative)
- exponent = the 8-bit unsigned integer stored in bits 30-23
- bias = 127 for FP32, 1023 for FP64
- mantissa = the 23 fraction bits (bits 22-0), with an implicit leading 1 for normal numbers
This formula is like scientific notation in binary: (-1)^sign × 1.m × 2^(E-bias).
FP32 bit layout examples
0.15625 in FP32
0 | 01111100 | 01000000000000000000000
| sign | exponent (8 bits) | fraction (23 bits) | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 30 | 23 | 22 | 0 | |||||||||||||||||||||||||||
= 0.15625
Applying the formula:
sign = 0→(-1)^0 = +1exponent = 01111100₂ = 124→2^(124 - 127) = 2^(-3) = 1/8mantissa = .0100...0₂ = 1/4→1 + 1/4 = 1.25
(+1) × 1.25 × 1/8 = 0.15625
-2.5 in FP32
1 | 10000000 | 01000000000000000000000
| sign | exponent (8 bits) | fraction (23 bits) | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 30 | 23 | 22 | 0 | |||||||||||||||||||||||||||
= -2.5
Applying the formula:
sign = 1→(-1)^1 = -1exponent = 10000000₂ = 128→2^(128 - 127) = 2^1 = 2mantissa = .0100...0₂ = 1/4→1 + 1/4 = 1.25
(-1) × 1.25 × 2 = -2.5
FP64 examples
For FP64, the formula is the same but with bias = 1023:
value = (-1)^sign x 2^(exponent - 1023) x 1.mantissa
0.15625 in FP64
0 | 01111111100 | 010000...000000
| sign | exponent (11 bits) | fraction (52 bits) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 63 | 62 | 52 | 51 | 0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
= 0.15625
Applying the formula:
sign = 0->(-1)^0 = +1exponent = 01111111100₂ = 1020->2^(1020 - 1023) = 2^-3 = 1/8mantissa = .0100...0₂ = 1/4->1 + 1/4 = 1.25
(+1) x 1.25 x 1/8 = 0.15625
-2.5 in FP64
1 | 10000000000 | 010000...000000
| sign | exponent (11 bits) | fraction (52 bits) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 63 | 62 | 52 | 51 | 0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
= -2.5
Applying the formula:
sign = 1->(-1)^1 = -1exponent = 10000000000₂ = 1024->2^(1024 - 1023) = 2^1 = 2mantissa = .0100...0₂ = 1/4->1 + 1/4 = 1.25
(-1) x 1.25 x 2 = -2.5
Now that we understand this a bit better we can move onto all these new types that have become so popular with GPUs and ML
GPUs get all the floats
Let’s break down these different types and when they released.
Release Dates
| Type | Year | Hardware | Notes |
|---|---|---|---|
| FP64, FP32 | 1985 | IEEE 754 | the originals |
| FP16 | 2002 | GeForce FX | first GPU float16 (graphics only) |
| FP16 | 2016 | Pascal GP100 | real FP16 ML compute support |
| BF16 | 2017 | Google TPU v2 | ML focused 16-bit format |
| BF16 | 2020 | Ampere / Cooper Lake | widespread GPU/CPU adoption |
| TF32 | 2020 | Ampere A100 | FP32 speedup |
| FP8 | 2022 | Hopper H100 | 8-bit training |
| FP4 | 2024 | Blackwell B200 | 4-bit inference |
Floating Point Breakdowns
| Type | Sign | Exponent | Mantissa | Total Bits |
|---|---|---|---|---|
| FP64 / double | 1 | 11 | 52 | 64 |
| FP32 / float | 1 | 8 | 23 | 32 |
| TF32 | 1 | 8 | 10 | 19 |
| BF16 | 1 | 8 | 7 | 16 |
| FP16 / half | 1 | 5 | 10 | 16 |
| FP8 (E4M3) | 1 | 4 | 3 | 8 |
| FP8 (E5M2) | 1 | 5 | 2 | 8 |
| FP4 | 1 | 2 | 1 | 4 |
Tensor Float 32 (TF32)
Around 2017 Nvidia started making GPUs with Tensor Cores (Volta V100), and with Ampere (A100) in 2020 they added TF32 support. These are in total 19 bits and they have an accumulator that accumulates back to a float32. The loss of precision does not matter for training and the loss is pretty small. Faster training speeds is worth the precision.
Input: FP32 (23-bit mantissa)
↓
Multiply: TF32 (10-bit mantissa)
↓
Accumulate: FP32 (23-bit mantissa)
The way this works is that the precision is dropped to 10 mantissa at multiply but when adding products together that is done at the full 23 bit mantissa. The GPU has specific tensor cores that are created and optimized for these calculations. Below is a more complete example.
| 2.5 | |||||||||||||||||||||||
| |||||||||||||||||||||||
| 4.5 | ||||||||||||||||||||||
| ||||||||||||||||||||||
| K-step | Multiply | Partial Product | acc before | acc after | Precision |
|---|---|---|---|---|---|
| 0 | 2.5×4.5 | 11.25 | 0.0 | 11.25 | FP32 |
| 1 | 3.1×2.2 | 6.82 | 11.25 | 18.07 | FP32 |
| 2 | 1.7×0.9 | 1.53 | 18.07 | 19.60 | FP32 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | |
| 15 | 0.4×8.1 | 3.24 | ... | final dot product | FP32 |
Since the accumulator keeps full FP32 precision during addition you limit the precision loss. For scientific computing this would not be a good idea but for ML training this precision is rarely worth the cost for larger multi-billion param models.
Brain Float 16 (BF16)
BF16 and FP16 are the exact same amount of bits so what gives? The range of the numbers for BF16 is greater than FP16 but with lesser precision. BF16 has 8 exponents and FP16 has 5.
The primary reason for BF16 was to make training easier by reducing the need to apply loss scaling when training large models (billions of params)
To protect FP16 from underflow and overflow during backprop you need to scale the values. The reason you have to protect against this is because for large models the logits can get very large and overflow the range of FP16. The other issue is that the vanishing gradient issue can pop up when training the large models. So you need to scale the loss which looks something like this
loss_scaled = loss × S # one scalar, cheap
# backprop runs: ALL gradients are now ×S in size
# FP16 can represent them because they're no longer tiny
# BEFORE optimizer.step():
for grad in model.parameters():
grad /= S # done in FP32, exact
optimizer.step() # sees correct-magnitude gradients
Using Pytorch this is done by using GradScaler
scaler = torch.amp.GradScaler("cuda") # manages S automatically
for batch in dataloader:
with torch.autocast(device_type='cuda', dtype=torch.float16):
loss = model(batch) # forward in FP16
scaler.scale(loss).backward() # loss × S, then backprop (grads are ×S)
scaler.unscale_(optimizer) # divides all grads by S in FP32
scaler.step(optimizer) # optimizer sees correct magnitudes
scaler.update() # adjusts S up or down for next step
The problem with this is sometimes its very hard to find the right value to scale by. Many times you can solve overflow but not underflow. In order to solve this BF16 comes to the rescue.
BF16 has the same range as FP32 but with less precision. This is well worth the trade off since now we don’t have to apply scaling and deal with these issues. Our gradients can flow more properly.
There are some instances where you may prefer the extra precision of FP16 such as after pre-training and supervised training. The weights are more stable and maybe if you want to do reinforcement learning you may want to switch to FP16 for extra precision. This is not always true and many people have used BF16 for RL tuning.
Floating Point 8 (FP8) and Floating Point 4 (FP4)
FP8 comes in two versions
- E4M3 (4 exponent, 3 mantissa)
- E5M2 (5 exponent, 2 mantissa).
Both are needed because activations and gradients have different requirements:
- E4M3 (range ±448): More precision for weights and activations (forward pass).
- E5M2 (range ±57344): More range for gradients (backward pass). Gradients span many orders of magnitude and need the wider range to avoid overflow.
| sign | exponent (4 bits) | mantissa (3 bits) | |||||
|---|---|---|---|---|---|---|---|
| S | E | E | E | E | M | M | M |
| 7 | 6 | 3 | 2 | 0 | |||
| sign | exponent (5 bits) | mantissa (2 bits) | |||||
|---|---|---|---|---|---|---|---|
| S | E | E | E | E | E | M | M |
| 7 | 6 | 2 | 1 | 0 | |||
The H100 introduced FP8 tensor cores. Training with FP8 follows the same mixed-precision pattern as TF32
The major new piece that allows this to work is per tensor scaling. You need to find a scale value for each tensor in order to fit it into the range for FP8
# per-tensor scaling for FP8
amax = input.abs().max() # largest value in the tensor
scale = 448.0 / amax # map max → FP8 max (448 for E4M3)
scaled = (input * scale).to(torch.float8_e4m3fn) # quantize to FP8
The scale factor is stored alongside the FP8 tensor so the next layer knows how to interpret it. PyTorch handles this automatically with torch.fp8 APIs.
With newer NVIDIA cards like H100 you can actually train in FP8. You don’t even need to worry about the loss scaler since all tensors get scaled properly.
Floating Point 4 (FP4)
FP4 is experimental. It can really save you memory but in my experience some models start to show degradation at this low precision. If the model was trained using quantized aware training then it can still be pretty good. I hope we continue to research and try new things at these lower precision data types.
Here is the standard format
E2M1 (2 exponent, 1 mantissa). With 4 bits total, you can represent exactly 16 values:
| sign | exponent (2 bits) | mantissa (1 bit) | |
|---|---|---|---|
| S | E | E | M |
| 3 | 2 | 1 | 0 |
| Bits | Value | Bits | Value |
|---|---|---|---|
| 0 00 0 | +0.0 | 1 00 0 | −0.0 |
| 0 00 1 | +0.5 | 1 00 1 | −0.5 |
| 0 01 0 | +1.0 | 1 01 0 | −1.0 |
| 0 01 1 | +1.5 | 1 01 1 | −1.5 |
| 0 10 0 | +2.0 | 1 10 0 | −2.0 |
| 0 10 1 | +3.0 | 1 10 1 | −3.0 |
| 0 11 0 | +4.0 | 1 11 0 | −4.0 |
| 0 11 1 | +6.0 | 1 11 1 | −6.0 |
So can you train in FP4? Not really or at least not yet. We lose so much precision that when you compute gradients you will run into too much underflow/overflow. This mode is meant to be used for inference only at the moment.
The model is trained in BF16/FP16 and then quantized and scaled to 4bit for inference.
There is some research on FP4 training that uses similar techniques to FP8. In a practical sense it is still too unstable for adoption.
Below are some papers about attempting training at FP4:
-
FP4 All the Way (Chmiel et al., 2025)
-
Quartet (Castro et al., 2025)
-
Pretraining LLMs with NVFP4 (NVIDIA, 2025)
I think we will have large adoption of FP4 and other smaller precisions in the future.
Conclusion
| Format | Best for | Hardware | Range | Notes |
|---|---|---|---|---|
| FP64 | Scientific computing, numerical simulation | CPUs, HPC GPUs (A100, H100, B200) | ±1.8 × 10³⁰⁸ | Overkill for ML: use only when double precision is required |
| FP32 | Master weights, loss scaling, accumulator | All hardware | ±3.4 × 10³⁸ | The reference precision. Everything casts down from here |
| TF32 | Training throughput on NVIDIA Tensor Cores | Ampere+ (A100, H100, B200) | ±3.4 × 10³⁸ | Transparent drop-in for FP32 in PyTorch: free speed on compatible GPUs |
| BF16 | Training large models (7B+) | TPU v2+, Ampere+ GPUs | ±3.4 × 10³⁸ | Best range/precision trade-off for training. No loss scaling needed |
| FP16 | Fine-tuning, inference, smaller models | Most GPUs (Pascal+) | ±65,504 | More precision than BF16 but less range: requires GradScaler for training |
| FP8 (E4M3) | Forward pass, inference quantization | Hopper+ (H100, B200) | ±448 | Weights and activations during FP8 training |
| FP8 (E5M2) | Gradients during FP8 training | Hopper+ (H100, B200) | ±57,344 | Extra range needed for backprop |
| FP4 | Inference quantization | Blackwell B200 (experimental) | ±6 | Training still research-only. Use for memory-bound inference |
Now we have a much better understanding of how these data types work and how they can help ML training. GPUs keep improving and the hardware can be pretty complicated.
Keep these data types in mind when you have to purchase hardware for yourself or your company. You want hardware that supports your training/inference needs. Good luck and maybe try some FP4 training on your own.