There are too many floats now and no one knows what they are. Everyone knows about 64- and 32-bit floats but over the past 10 years all these “new” types have emerged. You start mentioning brain floats to someone and you have lost the audience. If you work in ML or even if you primarily work on serving models for inference you need to understand these data types. Grab hold of your exponents and let’s see where the mantissa takes us.

Classic IEEE 754

We must understand what a floating point is and how CPUs have been using them. In 1985 IEEE established the standard for floating point numbers. All floats comprise three things

1) Sign bit

2) Exponent

3) Mantissa

flowchart TD A[Floating Point] --> B[Sign Bit] A --> C[Exponent Bits] A --> D[Mantissa Bits] B --> B1["1 bit
0=Positive, 1=Negative"] C --> C1[Biased exponent for range] D --> D1[Fractional precision]

The sign is 1 bit that is either 0 (positive) or 1 (negative)

The exponent controls the range (how small or large) of the number. The more exponents the wider the range.

The mantissa controls the precision of the number. It stores the fractional part. More mantissa bits means you will have a more precise value.

Examples of 64 and 32 bit numbers

Here are what some examples would look like. These are the types of computations that CPUs have been doing for a very long time.

The IEEE 754 formula

IEEE 754 standard states that the value of a binary32 (FP32) number is:

value = (-1)^sign × 2^(exponent - bias) × 1.mantissa

Where:

sign = bit 31 (0 = positive, 1 = negative)
exponent = the 8-bit unsigned integer stored in bits 30-23
bias = 127 for FP32, 1023 for FP64
mantissa = the 23 fraction bits (bits 22-0), with an implicit leading 1 for normal numbers

This formula is like scientific notation in binary: (-1)^sign × 1.m × 2^(E-bias).

FP32 bit layout examples

0.15625 in FP32

0 | 01111100 | 01000000000000000000000

sign	exponent (8 bits)								fraction (23 bits)
0	0	1	1	1	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
31	30							23	22																						0

= 0.15625

Applying the formula:

sign = 0 → (-1)^0 = +1
exponent = 01111100₂ = 124 → 2^(124 - 127) = 2^(-3) = 1/8
mantissa = .0100...0₂ = 1/4 → 1 + 1/4 = 1.25

(+1) × 1.25 × 1/8 = 0.15625

-2.5 in FP32

1 | 10000000 | 01000000000000000000000

sign	exponent (8 bits)								fraction (23 bits)
1	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
31	30							23	22																						0

= -2.5

Applying the formula:

sign = 1 → (-1)^1 = -1
exponent = 10000000₂ = 128 → 2^(128 - 127) = 2^1 = 2
mantissa = .0100...0₂ = 1/4 → 1 + 1/4 = 1.25

(-1) × 1.25 × 2 = -2.5

FP64 examples

For FP64, the formula is the same but with bias = 1023:

value = (-1)^sign x 2^(exponent - 1023) x 1.mantissa

0.15625 in FP64

0 | 01111111100 | 010000...000000

sign	exponent (11 bits)											fraction (52 bits)
0	0	1	1	1	1	1	1	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

= 0.15625

Applying the formula:

sign = 0 -> (-1)^0 = +1
exponent = 01111111100₂ = 1020 -> 2^(1020 - 1023) = 2^-3 = 1/8
mantissa = .0100...0₂ = 1/4 -> 1 + 1/4 = 1.25

(+1) x 1.25 x 1/8 = 0.15625

-2.5 in FP64

1 | 10000000000 | 010000...000000

sign	exponent (11 bits)											fraction (52 bits)
1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

= -2.5

Applying the formula:

sign = 1 -> (-1)^1 = -1
exponent = 10000000000₂ = 1024 -> 2^(1024 - 1023) = 2^1 = 2
mantissa = .0100...0₂ = 1/4 -> 1 + 1/4 = 1.25

(-1) x 1.25 x 2 = -2.5

Now that we understand this a bit better we can move onto all these new types that have become so popular with GPUs and ML

GPUs get all the floats

Let’s break down these different types and when they released.

Release Dates

Type	Year	Hardware	Notes
FP64, FP32	1985	IEEE 754	the originals
FP16	2002	GeForce FX	first GPU float16 (graphics only)
FP16	2016	Pascal GP100	real FP16 ML compute support
BF16	2017	Google TPU v2	ML focused 16-bit format
BF16	2020	Ampere / Cooper Lake	widespread GPU/CPU adoption
TF32	2020	Ampere A100	FP32 speedup
FP8	2022	Hopper H100	8-bit training
FP4	2024	Blackwell B200	4-bit inference

Floating Point Breakdowns

Type	Sign	Exponent	Mantissa	Total Bits
FP64 / double	1	11	52	64
FP32 / float	1	8	23	32
TF32	1	8	10	19
BF16	1	8	7	16
FP16 / half	1	5	10	16
FP8 (E4M3)	1	4	3	8
FP8 (E5M2)	1	5	2	8
FP4	1	2	1	4

Tensor Float 32 (TF32)

Around 2017 Nvidia started making GPUs with Tensor Cores (Volta V100), and with Ampere (A100) in 2020 they added TF32 support. These are in total 19 bits and they have an accumulator that accumulates back to a float32. The loss of precision does not matter for training and the loss is pretty small. Faster training speeds is worth the precision.

Input:  FP32 (23-bit mantissa)
           ↓
Multiply: TF32 (10-bit mantissa)
           ↓
Accumulate: FP32 (23-bit mantissa)

The way this works is that the precision is dropped to 10 mantissa at multiply but when adding products together that is done at the full 23 bit mantissa. The GPU has specific tensor cores that are created and optimized for these calculations. Below is a more complete example.

1) 2.5 and 4.5 enter as FP32 (32-bit fields on the wire)

2.5

0
1
0
0
0
0
0
0
0
0
1
0
0
...0s (20 more bits)

4.5

0
1
0
0
0
0
0
0
1
0
0
1
...0s (20 more bits)
sign exponent kept (10-bit mantissa) ignored (13 bits)
2) Multiply reads only the purple bits (first 10 mantissa bits)
  TF32 multiply: 2.5 × 4.5 = 11.25
3) Each K-step feeds one product into the FP32 accumulator (hardware adder on die):



K-step
Multiply
Partial Product
acc before
acc after
Precision

0
2.5×4.5
11.25
0.0
11.25
FP32

1
3.1×2.2
6.82
11.25
18.07
FP32

2
1.7×0.9
1.53
18.07
19.60
FP32

⋮
⋮
⋮
⋮
⋮


15
0.4×8.1
3.24
...
final dot product
FP32

K-step	Multiply	Partial Product	acc before	acc after	Precision
0	2.5×4.5	11.25	0.0	11.25	FP32
1	3.1×2.2	6.82	11.25	18.07	FP32
2	1.7×0.9	1.53	18.07	19.60	FP32
⋮	⋮	⋮	⋮	⋮
15	0.4×8.1	3.24	...	final dot product	FP32

Since the accumulator keeps full FP32 precision during addition you limit the precision loss. For scientific computing this would not be a good idea but for ML training this precision is rarely worth the cost for larger multi-billion param models.

Brain Float 16 (BF16)

BF16 and FP16 are the exact same amount of bits so what gives? The range of the numbers for BF16 is greater than FP16 but with lesser precision. BF16 has 8 exponents and FP16 has 5.

The primary reason for BF16 was to make training easier by reducing the need to apply loss scaling when training large models (billions of params)

To protect FP16 from underflow and overflow during backprop you need to scale the values. The reason you have to protect against this is because for large models the logits can get very large and overflow the range of FP16. The other issue is that the vanishing gradient issue can pop up when training the large models. So you need to scale the loss which looks something like this

loss_scaled = loss × S              # one scalar, cheap

# backprop runs: ALL gradients are now ×S in size
# FP16 can represent them because they're no longer tiny

# BEFORE optimizer.step():
for grad in model.parameters():
    grad /= S                       # done in FP32, exact
    
optimizer.step()                    # sees correct-magnitude gradients

Using Pytorch this is done by using GradScaler

scaler = torch.amp.GradScaler("cuda")   # manages S automatically

for batch in dataloader:
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        loss = model(batch)            # forward in FP16

    scaler.scale(loss).backward()      # loss × S, then backprop (grads are ×S)
    scaler.unscale_(optimizer)         # divides all grads by S in FP32
    scaler.step(optimizer)             # optimizer sees correct magnitudes
    scaler.update()                    # adjusts S up or down for next step

The problem with this is sometimes its very hard to find the right value to scale by. Many times you can solve overflow but not underflow. In order to solve this BF16 comes to the rescue.

BF16 has the same range as FP32 but with less precision. This is well worth the trade off since now we don’t have to apply scaling and deal with these issues. Our gradients can flow more properly.

There are some instances where you may prefer the extra precision of FP16 such as after pre-training and supervised training. The weights are more stable and maybe if you want to do reinforcement learning you may want to switch to FP16 for extra precision. This is not always true and many people have used BF16 for RL tuning.

Floating Point 8 (FP8) and Floating Point 4 (FP4)

FP8 comes in two versions

E4M3 (4 exponent, 3 mantissa)
E5M2 (5 exponent, 2 mantissa).

Both are needed because activations and gradients have different requirements:

E4M3 (range ±448): More precision for weights and activations (forward pass).
E5M2 (range ±57344): More range for gradients (backward pass). Gradients span many orders of magnitude and need the wider range to avoid overflow.

FP8 E4M3 (1 sign | 4 exponent | 3 mantissa): weights and activations, max value 448

sign	exponent (4 bits)				mantissa (3 bits)
S	E	E	E	E	M	M	M
7	6			3	2		0

FP8 E5M2 (1 sign | 5 exponent | 2 mantissa): gradients, max value 57344

sign	exponent (5 bits)					mantissa (2 bits)
S	E	E	E	E	E	M	M
7	6				2	1	0

The H100 introduced FP8 tensor cores. Training with FP8 follows the same mixed-precision pattern as TF32

FP8 mixed-precision training flow
Forward:
master weights
FP32
→
cast + scale
FP8 E4M3
→
activations
FP8 E4M3
×
weights
FP8 E4M3
→
matmul
FP8 multiply
→
accumulate
FP32
→
loss
FP32
Backward:
gradients
FP8 E5M2
→
dequantize
FP32
→
optimizer
FP32
→
update
FP32 master

  scale = FP8_max / tensor_amax  • 
  dequantize = result / scale  • 
  forward uses E4M3  • 
  backward uses E5M2

The major new piece that allows this to work is per tensor scaling. You need to find a scale value for each tensor in order to fit it into the range for FP8

# per-tensor scaling for FP8
amax = input.abs().max()              # largest value in the tensor
scale = 448.0 / amax                   # map max → FP8 max (448 for E4M3)
scaled = (input * scale).to(torch.float8_e4m3fn)   # quantize to FP8

The scale factor is stored alongside the FP8 tensor so the next layer knows how to interpret it. PyTorch handles this automatically with torch.fp8 APIs.

With newer NVIDIA cards like H100 you can actually train in FP8. You don’t even need to worry about the loss scaler since all tensors get scaled properly.

Floating Point 4 (FP4)

FP4 is experimental. It can really save you memory but in my experience some models start to show degradation at this low precision. If the model was trained using quantized aware training then it can still be pretty good. I hope we continue to research and try new things at these lower precision data types.

Here is the standard format

E2M1 (2 exponent, 1 mantissa). With 4 bits total, you can represent exactly 16 values:

FP4 E2M1 (1 sign | 2 exponent | 1 mantissa)

sign	exponent (2 bits)		mantissa (1 bit)
S	E	E	M
3	2	1	0

All 16 representable values:

Bits	Value	Bits	Value
0 00 0	+0.0	1 00 0	−0.0
0 00 1	+0.5	1 00 1	−0.5
0 01 0	+1.0	1 01 0	−1.0
0 01 1	+1.5	1 01 1	−1.5
0 10 0	+2.0	1 10 0	−2.0
0 10 1	+3.0	1 10 1	−3.0
0 11 0	+4.0	1 11 0	−4.0
0 11 1	+6.0	1 11 1	−6.0

So can you train in FP4? Not really or at least not yet. We lose so much precision that when you compute gradients you will run into too much underflow/overflow. This mode is meant to be used for inference only at the moment.

The model is trained in BF16/FP16 and then quantized and scaled to 4bit for inference.

There is some research on FP4 training that uses similar techniques to FP8. In a practical sense it is still too unstable for adoption.

Below are some papers about attempting training at FP4:

FP4 All the Way (Chmiel et al., 2025)
Quartet (Castro et al., 2025)
Pretraining LLMs with NVFP4 (NVIDIA, 2025)

I think we will have large adoption of FP4 and other smaller precisions in the future.

Conclusion

Format	Best for	Hardware	Range	Notes
FP64	Scientific computing, numerical simulation	CPUs, HPC GPUs (A100, H100, B200)	±1.8 × 10³⁰⁸	Overkill for ML: use only when double precision is required
FP32	Master weights, loss scaling, accumulator	All hardware	±3.4 × 10³⁸	The reference precision. Everything casts down from here
TF32	Training throughput on NVIDIA Tensor Cores	Ampere+ (A100, H100, B200)	±3.4 × 10³⁸	Transparent drop-in for FP32 in PyTorch: free speed on compatible GPUs
BF16	Training large models (7B+)	TPU v2+, Ampere+ GPUs	±3.4 × 10³⁸	Best range/precision trade-off for training. No loss scaling needed
FP16	Fine-tuning, inference, smaller models	Most GPUs (Pascal+)	±65,504	More precision than BF16 but less range: requires GradScaler for training
FP8 (E4M3)	Forward pass, inference quantization	Hopper+ (H100, B200)	±448	Weights and activations during FP8 training
FP8 (E5M2)	Gradients during FP8 training	Hopper+ (H100, B200)	±57,344	Extra range needed for backprop
FP4	Inference quantization	Blackwell B200 (experimental)	±6	Training still research-only. Use for memory-bound inference

Now we have a much better understanding of how these data types work and how they can help ML training. GPUs keep improving and the hardware can be pretty complicated.

Keep these data types in mind when you have to purchase hardware for yourself or your company. You want hardware that supports your training/inference needs. Good luck and maybe try some FP4 training on your own.

sign	exponent (11 bits)											fraction (52 bits)
0	0	1	1	1	1	1	1	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

sign	exponent (11 bits)											fraction (52 bits)
1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

sign	exponent (11 bits)											fraction (52 bits)
0	0	1	1	1	1	1	1	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

sign	exponent (11 bits)											fraction (52 bits)
1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

sign	exponent (11 bits)											fraction (52 bits)
0	0	1	1	1	1	1	1	1	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0

sign	exponent (11 bits)											fraction (52 bits)
1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
63	62										52	51																																																			0