Results 3 issues of ZoneFv

run inference.py with 7b_tiva_v0, I found that the model can't stable generate image/video/audio. The LLM always output without signal tokens。like this: ![image](https://github.com/NExT-GPT/NExT-GPT/assets/13346651/740bef1f-3618-4207-a815-88199a05b249)

we use 8 A100 80G to training this code,found that most of the time the GPU utilization rate is 0, and the cpu load is high. Is there something wrong...