AiOS
AiOS copied to clipboard
Error running the demo
Thanks for your fantastic work, but I encountered a series of problems when running the demo. I really appreciate it if you can give me some help. Here are the problems I got:
Environment Error
If I follow the instructions in README to install pytorch 1.10.1 and then pytorch3d, there will be a mismatch of CUDA version error. The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.
I solved this by installing the latest pytorch 2.3.1 and manually download the pytorch3d conda package and install it. I don't know if I should install an older version of Nvidia driver on my machine.
debugpy always waiting
If I don't comment the line debugpy.wait_for_client(), the code will just stop there and wait forever to expect the debugpy client to start.
def main(args):
utils.init_distributed_mode_ssc(args)
# utils.init_distributed_mode(args)
if args.rank == 0:
debugpy.listen(("127.0.0.1", 10086))
debugpy.wait_for_client()
print('Loading config file from {}'.format(args.config_file))
shutil.copy2(args.config_file,'config/aios_smplx.py')
Some distributed running error If I use the default mmcv distributed in the code, I have the following error, which seems like a bug related to device type:
[rank0]: Traceback (most recent call last):
[rank0]: File "main.py", line 395, in <module>
[rank0]: main(args)
[rank0]: File "main.py", line 297, in main
[rank0]: inference(model,
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]: outputs, targets, data_batch_nc = model(data_batch)
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 162, in _run_ddp_forward
[rank0]: inputs, kwargs = self.to_kwargs( # type: ignore
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
[rank0]: return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
[rank0]: inputs = scatter(inputs, target_gpus, dim) if inputs else []
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
[rank0]: return scatter_map(inputs)
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]: return list(zip(*map(scatter_map, obj)))
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
[rank0]: out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]: return list(zip(*map(scatter_map, obj)))
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
[rank0]: return Scatter.forward(target_gpus, obj.data)
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in forward
[rank0]: streams = [_get_stream(device) for device in target_gpus]
[rank0]: File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
[rank0]: streams = [_get_stream(device) for device in target_gpus]
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 119, in _get_stream
[rank0]: if device.type == "cpu":
[rank0]: AttributeError: 'int' object has no attribute 'type'
If I disable distributed running, another error showed up, which also seems to be related to data type convertion:
[rank0]: Traceback (most recent call last):
[rank0]: File "main.py", line 395, in <module>
[rank0]: main(args)
[rank0]: File "main.py", line 297, in main
[rank0]: inference(model,
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]: outputs, targets, data_batch_nc = model(data_batch)
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward
[rank0]: samples, targets = self.prepare_targets(data_batch)
[rank0]: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets
[rank0]: data_batch_coco = []
[rank0]: AttributeError: 'DataContainer' object has no attribute 'float'
My environment for running the code is:
OS: Ubuntu 24.04 LTS x86_64
Kernel: 6.8.0-38-generic
CPU: 13th Gen Intel i9-13900K (32) @ 5.500GHz
GPU: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4
Hi @caolonghao,
- I haven't tested the version you installed. Our code is compatible with most versions of PyTorch and CUDA. The main issue is with PyTorch3D, which is used for vis, and we've only tested it with version 0.6.1. I think you can give it a try. If you can successfully install it, I don't think there will be major problems.
- The debugpy.wait_for_client() line is for remote debugging and should have been removed.
- How did you run the code? Could you please try running it using the following command:
sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 1. I will update this part to support more ways to run the code.
Still error there, maybe you can pack up a colab demo so that it can be reproduce more easily
Traceback (most recent call last):
File "main.py", line 389, in <module>
main(args)
File "main.py", line 291, in main
inference(model,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/engine.py", line 338, in inference
outputs, targets, data_batch_nc = model(data_batch)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/aios_smplx.py", line 1001, in forward
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 332, in forward
memory, enc_intermediate_output, enc_intermediate_refpoints = self.encoder(
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 642, in forward
output = layer(src=output,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer_deformable.py", line 62, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points,
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/ops/modules/ms_deform_attn.py", line 96, in forward
value = self.value_proj(input_flatten)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]: Traceback (most recent call last):
[rank0]: File "main.py", line 395, in
the same problem..........
@caolonghao you can still use the default mmcv distributed in the code but with modifications from https://github.com/open-mmlab/mmdetection/issues/10720 and https://github.com/HarborYuan/mmcv_16/commit/ad1a72fe0cbeead2716706ff618dfa0269d2cf4c. Then you should be good to go.
@caolonghao you can still use the default mmcv distributed in the code but with modifications from open-mmlab/mmdetection#10720 and HarborYuan/mmcv_16@ad1a72f. Then you should be good to go.
Thanks, this solved my problem. I installed pytorch 2.4.1-cuda 12.1 and pytorch3d from the conda file. After that, I modified mmcv like you described. Then the code can run as expected.