graphcast icon indicating copy to clipboard operation
graphcast copied to clipboard

Issues Encountered During Inference Prediction

Open zomosky opened this issue 4 months ago • 0 comments

First of all, thank you very much for sharing the model code and weights. The structure of GraphCast is very interesting! However, I encountered some problems during live inference prediction.

Issue description:

Specifically, the process sometimes gets stuck when generating member outputs. There is no clear pattern regarding which forecast task or member triggers the issue.

When inspecting the stuck process using strace -p, I consistently observe output like:

poll([{fd=12, events=POLLIN}, ...], 11, 100) = 0 (Timeout)

Process info:

2957897 zomo   20   0 322.1g 97.6g   1.3g S   0.3  25.9   0:59.43 cuda-EvtHandlr

Prediction configuration:

  • 15-day forecast for 11 ensemble members
  • GPU: NVIDIA H100
  • Driver: 535.161.07
  • CUDA: 12.2
  • jax: 0.4.23
  • jaxlib: 0.4.23+cuda12.cudnn89
  • graphcast Version: 0.1

Logs are attached for your reference.


Additional Warning Messages:

Besides the above issue, I also noticed that even for normal inference runs, the following warnings always appear in the log. I'm not sure if they are related or might cause other problems:

2025-08-16 16:02:45.186950: I external/tsl/tsl/platform/default/subprocess.cc:308] Start cannot spawn child process: No such file or directory
2025-08-16 16:02:45.195876: I external/tsl/tsl/platform/default/subprocess.cc:308] Start cannot spawn child process: No such file or directory
2025-08-16 16:02:45.196705: W external/xla/xla/service/gpu/nvptx_compiler.cc:698] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.3.107). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.

It would be greatly appreciated if you could take a look and let me know if there are any solutions or possible causes. Thank you again for the amazing work on GraphCast!


一次出问题的log.txt

Logs are attached below.

zomosky avatar Aug 19 '25 07:08 zomosky