AiOS icon indicating copy to clipboard operation
AiOS copied to clipboard

Error running the demo

Open caolonghao opened this issue 1 year ago • 3 comments

Thanks for your fantastic work, but I encountered a series of problems when running the demo. I really appreciate it if you can give me some help. Here are the problems I got: Environment Error If I follow the instructions in README to install pytorch 1.10.1 and then pytorch3d, there will be a mismatch of CUDA version error. The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.

I solved this by installing the latest pytorch 2.3.1 and manually download the pytorch3d conda package and install it. I don't know if I should install an older version of Nvidia driver on my machine.

debugpy always waiting If I don't comment the line debugpy.wait_for_client(), the code will just stop there and wait forever to expect the debugpy client to start.

def main(args):
    
    utils.init_distributed_mode_ssc(args)
    # utils.init_distributed_mode(args)
    if args.rank == 0:
        debugpy.listen(("127.0.0.1", 10086))
        debugpy.wait_for_client()
    print('Loading config file from {}'.format(args.config_file))
    shutil.copy2(args.config_file,'config/aios_smplx.py')

Some distributed running error If I use the default mmcv distributed in the code, I have the following error, which seems like a bug related to device type:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 162, in _run_ddp_forward
[rank0]:     inputs, kwargs = self.to_kwargs(  # type: ignore
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
[rank0]:     return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
[rank0]:     inputs = scatter(inputs, target_gpus, dim) if inputs else []
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
[rank0]:     return scatter_map(inputs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
[rank0]:     out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
[rank0]:     return Scatter.forward(target_gpus, obj.data)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in forward
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 119, in _get_stream
[rank0]:     if device.type == "cpu":
[rank0]: AttributeError: 'int' object has no attribute 'type'

If I disable distributed running, another error showed up, which also seems to be related to data type convertion:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward
[rank0]:     samples, targets = self.prepare_targets(data_batch)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets
[rank0]:     data_batch_coco = []
[rank0]: AttributeError: 'DataContainer' object has no attribute 'float'

My environment for running the code is:

OS: Ubuntu 24.04 LTS x86_64
Kernel: 6.8.0-38-generic
CPU: 13th Gen Intel i9-13900K (32) @ 5.500GHz
GPU: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4 

caolonghao avatar Jul 24 '24 15:07 caolonghao

Hi @caolonghao,

  1. I haven't tested the version you installed. Our code is compatible with most versions of PyTorch and CUDA. The main issue is with PyTorch3D, which is used for vis, and we've only tested it with version 0.6.1. I think you can give it a try. If you can successfully install it, I don't think there will be major problems.
  2. The debugpy.wait_for_client() line is for remote debugging and should have been removed.
  3. How did you run the code? Could you please try running it using the following command:sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 1. I will update this part to support more ways to run the code.

ttxskk avatar Jul 25 '24 05:07 ttxskk

Still error there, maybe you can pack up a colab demo so that it can be reproduce more easily

Traceback (most recent call last):
  File "main.py", line 389, in <module>
    main(args)
  File "main.py", line 291, in main
    inference(model,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/engine.py", line 338, in inference
    outputs, targets, data_batch_nc = model(data_batch)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/aios_smplx.py", line 1001, in forward
    hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 332, in forward
    memory, enc_intermediate_output, enc_intermediate_refpoints = self.encoder(
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer.py", line 642, in forward
    output = layer(src=output,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/transformer_deformable.py", line 62, in forward
    src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points,
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/mycode/motion_estimation/AiOS-remove_debugpy/models/aios/ops/modules/ms_deform_attn.py", line 96, in forward
    value = self.value_proj(input_flatten)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tony/miniconda3/envs/aios_older/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

caolonghao avatar Jul 30 '24 09:07 caolonghao

[rank0]: Traceback (most recent call last): [rank0]: File "main.py", line 395, in [rank0]: main(args) [rank0]: File "main.py", line 297, in main [rank0]: inference(model, [rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference [rank0]: outputs, targets, data_batch_nc = model(data_batch) [rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward [rank0]: samples, targets = self.prepare_targets(data_batch) [rank0]: File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets [rank0]: data_batch_coco = [] [rank0]: AttributeError: 'DataContainer' object has no attribute 'float'

the same problem..........

leeooo001 avatar Aug 10 '24 00:08 leeooo001

@caolonghao you can still use the default mmcv distributed in the code but with modifications from https://github.com/open-mmlab/mmdetection/issues/10720 and https://github.com/HarborYuan/mmcv_16/commit/ad1a72fe0cbeead2716706ff618dfa0269d2cf4c. Then you should be good to go.

MoyGcc avatar Aug 26 '24 17:08 MoyGcc

@caolonghao you can still use the default mmcv distributed in the code but with modifications from open-mmlab/mmdetection#10720 and HarborYuan/mmcv_16@ad1a72f. Then you should be good to go.

Thanks, this solved my problem. I installed pytorch 2.4.1-cuda 12.1 and pytorch3d from the conda file. After that, I modified mmcv like you described. Then the code can run as expected.

caolonghao avatar Sep 18 '24 03:09 caolonghao