vggt
vggt copied to clipboard
Cannot train
I have tried many solutions from online and LLM, but none worked. I have 8xL40S GPU and 800GB RAM. I can run with #node=6 but when I tried to run with #node=7 or #node=8 the program dies showing
[rank0]: self._check_scale_growth_tracker("unscale_")
[rank0]: File "/data/user/youwyu/conda-env/vggt-env/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 162, in _check_scale_growth_tracker
[rank0]: assert self._scale is not None, (
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: Attempted unscale_ but _scale is None. This may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration.