vision chat error
Hi,
I'm trying to run run_vision_chat.sh but getting the following error:
(lwm) minyoung@claw2:~/Projects/LWM$ bash scripts/run_vision_chat.sh
I0215 18:19:20.605390 140230836105600 xla_bridge.py:689] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
I0215 18:19:20.607900 140230836105600 xla_bridge.py:689] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2024-02-15 18:19:29.755994: W external/xla/xla/service/gpu/nvptx_compiler.cc:744] The NVIDIA driver's CUDA version is 12.1 which is older than the ptxas CUDA version (12.3.107). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
Traceback (most recent call last):
File "/home/minyoung/anaconda3/envs/lwm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/minyoung/anaconda3/envs/lwm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/minyoung/Projects/LWM/lwm/vision_chat.py", line 254, in <module>
run(main)
File "/home/minyoung/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/minyoung/anaconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/minyoung/Projects/LWM/lwm/vision_chat.py", line 249, in main
sampler = Sampler()
File "/home/minyoung/Projects/LWM/lwm/vision_chat.py", line 42, in __init__
self.mesh = VideoLLaMAConfig.get_jax_mesh(FLAGS.mesh_dim)
File "/home/minyoung/Projects/LWM/lwm/llama.py", line 260, in get_jax_mesh
return get_jax_mesh(axis_dims, ('dp', 'fsdp', 'tp', 'sp'))
File "/home/minyoung/anaconda3/envs/lwm/lib/python3.10/site-packages/tux/distributed.py", line 140, in get_jax_mesh
mesh_shape = np.arange(jax.device_count()).reshape(dims).shape
ValueError: cannot reshape array of size 1 into shape (1,newaxis,32,1)
These are the model configs I used.
export llama_tokenizer_path="./LWM-Chat-1M-Jax/tokenizer.model"
export vqgan_checkpoint="./LWM-Chat-1M-Jax/vqgan"
export lwm_checkpoint="./LWM-Chat-1M-Jax/params"
export input_file="./traj0.mp4"
FYI what works for me:
#! /bin/bash
export SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
export PROJECT_DIR="$( cd -- "$( dirname -- "$SCRIPT_DIR" )" &> /dev/null && pwd )"
cd $PROJECT_DIR
export PYTHONPATH="$PYTHONPATH:$PROJECT_DIR"
export llama_tokenizer_path="LWM-Chat-1M-Jax/tokenizer.model"
export vqgan_checkpoint="LWM-Chat-1M-Jax/vqgan"
export lwm_checkpoint="LWM-Chat-1M-Jax/params"
export input_file="taylor.jpg"
python3 -u -m lwm.vision_chat \
--prompt="What is the image about?" \
--input_file="$input_file" \
--vqgan_checkpoint="$vqgan_checkpoint" \
--dtype='fp32' \
--load_llama_config='7b' \
--max_n_frames=8 \
--update_llama_config="dict(sample_mode='text',theta=50000000,max_sequence_length=131072,use_flash_attention=False,scan_attention=False,scan_query_chunk_size=128,scan_key_chunk_size=128,remat_attention='',scan_mlp=False,scan_mlp_chunk_size=2048,remat_mlp='',remat_block='',scan_layers=True)" \
--load_checkpoint="params::$lwm_checkpoint" \
--tokenizer.vocab_file="$llama_tokenizer_path" \
2>&1 | tee ~/output.log
read
But I didn't get video to work yet. Probably doesn't input mp4.
Also the --mesh_dim='!1,-1,32,1' \ seems off always, or has to be chosen or removed.
I wish the creators gave minimal running examples using the scripts.
Thanks for sharing, @pseudotensor ! I was also wondering if the .mp4 video file format is not supported.
is the .avi video format supported?
I got the same problem. It cannot process .mp4 file.
.mkv format works for me.
.mkv format works for me.
Would you mind sharing your script? I tried to use .mkv but still got the same error. Thank you for your help.
The mesh_dim argument depends on the number of devices you're using for inference. If you want to do tensor parallelism over 8 gpus, then mesh_dim should be 1,1,8,1. The default 32 might be too high if your machine doesn't have 32 devices.
Regarding supported video files, the code here:
https://github.com/LargeWorldModel/LWM/blob/0f441d39e46a607d64ea1e207eca7943306a1e3b/lwm/vision_chat.py#L84
just uses decord to read the video, so any video format that works for decord should work.