LLaVA
LLaVA copied to clipboard
[Question] Inconsistency between continuous inference in the same process and single inference in different processes
Question
Question
I try to use llava/serve/cli.py to do inference on multiple images.
But I found inconsistencies in the results between running cli.py multiple times using a shell script, inputting one image each time, and running cli.py only once, using a for loop to input one image each time.
I use the original llava/serve/cli.py and the following shell script to run cli.py multiple times and do single inference in different processes.
echo -e "Is the person in the picture sitting in the car?\n\n" | CUDA_VISIBLE_DEVICES=0,1 python llava/serve/cli.py --model-path liuhaotian/llava-v1.6-34b --image-file /root/data/pos/image1.jpg --temperature 0 --max-new-tokens 1024
echo -e "Is the person in the picture sitting in the car?\n\n" | CUDA_VISIBLE_DEVICES=0,1 python llava/serve/cli.py --model-path liuhaotian/llava-v1.6-34b --image-file /root/data/pos/image2.jpg --temperature 0 --max-new-tokens 1024
echo -e "Is the person in the picture sitting in the car?\n\n" | CUDA_VISIBLE_DEVICES=0,1 python llava/serve/cli.py --model-path liuhaotian/llava-v1.6-34b --image-file /root/data/pos/image3.jpg --temperature 0 --max-new-tokens 1024
echo -e "Is the person in the picture sitting in the car?\n\n" | CUDA_VISIBLE_DEVICES=0,1 python llava/serve/cli.py --model-path liuhaotian/llava-v1.6-34b --image-file /root/data/pos/image4.jpg --temperature 0 --max-new-tokens 1024
echo -e "Is the person in the picture sitting in the car?\n\n" | CUDA_VISIBLE_DEVICES=0,1 python llava/serve/cli.py --model-path liuhaotian/llava-v1.6-34b --image-file /root/data/pos/image5.jpg --temperature 0 --max-new-tokens 1024
In order to run cli.py only once, I made some changes to cli.py. On the basis of llava/serve/cli.py, I only made the following changes:
parser.add_argument("--model-path", type=str, default="liuhaotian/llava-v1.6-34b")
parser.add_argument("--temperature", type=float, default=0)
parser.add_argument("--max-new-tokens", type=int, default=1024)
image_dir = "/root/data/pos"
image_files = os.listdir(image_dir)
for image_file in image_files:
args.image_file = os.path.join(image_dir, image_file)
main(args)
Then for the same image, I got inconsistent results:
| Images | continuous inference in the same process | single inference in different processes |
|---|---|---|
| image1 | Yes, the person in the picture is sitting in the car. They appear to be in the driver's seat, as indicated by the steering wheel and the position of the person relative to the vehicle's controls. | Yes, the person in the picture is sitting in the car. They appear to be in the driver's seat, as indicated by the steering wheel and the position of the person relative to the vehicle's controls. |
| image2 | The image provided is very blurry and lacks clear details, making it difficult to discern any specific objects or people. If there is a person in the picture, it would be challenging to confirm their presence or actions due to the low quality of the image. | Yes, there is a person sitting in the driver's seat of the vehicle in the picture. |
| image3 | The image you've provided is very blurry and lacks clear details, making it difficult to discern any specific objects or people. If there is a person in the picture, it's not possible to determine their actions or whether they are in a car due to the quality of the image. | Yes, there is a person sitting in the driver's seat of the vehicle in the picture. |
| image4 | The image provided is very blurry and lacks clear details, making it difficult to discern any specific objects or people. If there is a person in the car, it's not possible to confirm that based on this image. | Yes, there is a person sitting in the driver's seat of the vehicle in the picture. |
| image5 | The image you've provided is very blurry and lacks any discernible details. It's not possible to determine if there is a person sitting in a car or any other context based on this image. If you have a clearer image or more information, I might be able to assist you better. | Yes, the person in the picture is sitting inside a vehicle, which appears to be a truck or a van. They are wearing a face mask and are seated in the passenger seat. |
Obviously, the results of the first image are consistent. But the following results are wrong when using the same process for continuous inference. It seems that the previous results have had a effect on the subsequent results.
But I only made few changes to llava/serve/cli.py. And I'm sure that I reload model for each image. The previous results should not have a effect on the subsequent results. The only difference is whether to restart the process.
So why are they not consistent? Is there any way to solve it?
And can I load the model only once to infer multiple images and ensure that they do not affect each other, in order to save time on loading the model?
Sam problem. Is there any update?
Any updates about this issue?