flax flax nn.tabulate Incorrectly Reports FLOPs and VJP FLOPs

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.4 LTS x86_64
Flax, jax, jaxlib versions:

Name: flax
Version: 0.8.4
---
Name: jax
Version: 0.4.30
---
Name: jaxlib
Version: 0.4.30

Python version: Python 3.12.4
GPU/TPU model and memory: NVIDIA GeForce GTX 3080 Ti
CUDA version: 12.2

Problem you have encountered:

When running a script to tabulate the model summary including FLOPs and VJP FLOPs using Flax's nn.tabulate function, the output incorrectly shows both FLOPs and VJP FLOPs as 0. This is unexpected as the model does perform computations that should result in a non-zero FLOPs count, and especially the VJP FLOPs should be a non-zero integer value given the model's structure and operations.

What you expected to happen:

The expected output should correctly calculate and display the FLOPs and VJP FLOPs for each layer in the model.

Logs, error messages, etc:

import flax.linen as nn
import jax
import jax.numpy as jnp

class Foo(nn.Module):
    @nn.compact
    def __call__(self, x):
        h = nn.Dense(4)(x)
        return nn.Dense(2)(h)

x = jnp.ones((16, 9))
tabulate_fn = nn.tabulate(
    Foo(), jax.random.PRNGKey(0), compute_flops=True, compute_vjp_flops=True)

print(tabulate_fn(x))
                                          Foo Summary                                          
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ path    ┃ module ┃ inputs        ┃ outputs       ┃ flops ┃ vjp_flops ┃ params               ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│         │ Foo    │ float32[16,9] │ float32[16,2] │ 0     │ 0         │                      │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│ Dense_0 │ Dense  │ float32[16,9] │ float32[16,4] │ 0     │ 0         │ bias: float32[4]     │
│         │        │               │               │       │           │ kernel: float32[9,4] │
│         │        │               │               │       │           │                      │
│         │        │               │               │       │           │ 40 (160 B)           │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│ Dense_1 │ Dense  │ float32[16,4] │ float32[16,2] │ 0     │ 0         │ bias: float32[2]     │
│         │        │               │               │       │           │ kernel: float32[4,2] │
│         │        │               │               │       │           │                      │
│         │        │               │               │       │           │ 10 (40 B)            │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│         │        │               │               │       │     Total │ 50 (200 B)           │
└─────────┴────────┴───────────────┴───────────────┴───────┴───────────┴──────────────────────┘

Jun 24 '24 17:06 Surya-77

Hmm unfortunately I cannot repro this (Flax 0.8.5). My printout yields this: This can be reproed by opening any empty colab.

                                          Foo Summary                                          
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ path    ┃ module ┃ inputs        ┃ outputs       ┃ flops ┃ vjp_flops ┃ params               ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│         │ Foo    │ float32[16,9] │ float32[16,2] │ 1504  │ 4460      │                      │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│ Dense_0 │ Dense  │ float32[16,9] │ float32[16,4] │ 1216  │ 3620      │ bias: float32[4]     │
│         │        │               │               │       │           │ kernel: float32[9,4] │
│         │        │               │               │       │           │                      │
│         │        │               │               │       │           │ 40 (160 B)           │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│ Dense_1 │ Dense  │ float32[16,4] │ float32[16,2] │ 288   │ 840       │ bias: float32[2]     │
│         │        │               │               │       │           │ kernel: float32[4,2] │
│         │        │               │               │       │           │                      │
│         │        │               │               │       │           │ 10 (40 B)            │
├─────────┼────────┼───────────────┼───────────────┼───────┼───────────┼──────────────────────┤
│         │        │               │               │       │     Total │ 50 (200 B)           │
└─────────┴────────┴───────────────┴───────────────┴───────┴───────────┴──────────────────────┘

Jun 25 '24 23:06 IvyZX

The code does work on the pinned package configurations on Colab and Kaggle, but fails to run when installed with the same package versions on a local machine. The provided data is based on a new install of flax, jax and jaxlib cuda on a mamba environment using pip. (Though that shouldn't affect it).

For reference, the Colab and Kaggle runtime use system level CuDA packages while the pip installed versions come with their own CuDA wheels.

Here's the minimal dependency list anyways.

Package                  Version
------------------------ ---------
absl-py                  2.1.0
asttokens                2.4.1
chex                     0.1.86
decorator                5.1.1
etils                    1.7.0
exceptiongroup           1.2.0
executing                2.0.1
flax                     0.8.4
fsspec                   2024.6.0
importlib_resources      6.4.0
ipython                  8.25.0
jax                      0.4.26
jax-cuda12-pjrt          0.4.26
jax-cuda12-plugin        0.4.26
jaxlib                   0.4.26
jedi                     0.19.1
markdown-it-py           3.0.0
matplotlib-inline        0.1.7
mdurl                    0.1.2
ml-dtypes                0.4.0
msgpack                  1.0.8
nest-asyncio             1.6.0
numpy                    2.0.0
nvidia-cublas-cu12       12.5.2.13
nvidia-cuda-cupti-cu12   12.5.39
nvidia-cuda-nvcc-cu12    12.5.40
nvidia-cuda-nvrtc-cu12   12.5.40
nvidia-cuda-runtime-cu12 12.5.39
nvidia-cudnn-cu12        8.9.7.29
nvidia-cufft-cu12        11.2.3.18
nvidia-cusolver-cu12     11.6.2.40
nvidia-cusparse-cu12     12.4.1.24
nvidia-nccl-cu12         2.22.3
nvidia-nvjitlink-cu12    12.5.40
opt-einsum               3.3.0
optax                    0.2.2
orbax-checkpoint         0.5.20
parso                    0.8.4
pexpect                  4.9.0
pickleshare              0.7.5
pip                      24.0
prompt_toolkit           3.0.47
protobuf                 5.27.2
ptyprocess               0.7.0
pure-eval                0.2.2
Pygments                 2.18.0
PyYAML                   6.0.1
rich                     13.7.1
scipy                    1.14.0
setuptools               70.1.1
six                      1.16.0
stack-data               0.6.2
tensorstore              0.1.63
toolz                    0.12.1
traitlets                5.14.3
typing_extensions        4.12.2
wcwidth                  0.2.13
wheel                    0.43.0
zipp                     3.19.2

Jun 26 '24 05:06 Surya-77

I am also getting 0 for flops and vjp_flops.

Python 3.12.7 CUDA 12.6 (RTX 4090) jax==0.6.0 jax-cuda12-pjrt==0.6.0 jax-cuda12-plugin==0.6.0 jaxlib==0.6.0

Even for the example listed in the docs:

https://flax-linen.readthedocs.io/en/latest/api_reference/flax.linen/inspection.html

Have now confirmed on 3 different machines, with CUDA 12.4 and CUDA 12.5, as well. Furthermore, manually calculating works:

import jax
from flax import linen as nn
import jax.numpy as jnp

class Foo(nn.Module):
  @nn.compact
  def __call__(self, x):
    h = nn.Dense(4)(x)
    return nn.Dense(2)(h)

model = Foo()
rng = random.key(42)
dummy = jnp.ones((1, 224, 224, 3))
variables = model.init(rng, dummy)
jit_fwd = jax.jit(lambda x: model.apply(variables, x), backend='gpu')
compiled = jit_fwd.lower(dummy).compile()
cost = compiled.cost_analysis()
print("Total FLOPs:", cost and cost.get("flops"))

All 3 machines report: Total FLOPs: 2007040.0

May 10 '25 16:05 danjenson