transformers-bloom-inference
transformers-bloom-inference copied to clipboard
The generated results are different when using greedy search during generation
Thank you very much for your work. I got a problem when I ran BLOOM-176B on 8*A100.
I followed the README.md and executed the following command. To be specific, I set do_sample = true and top_k = 1 which I thought it was equivalent to greedy search:
python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": true, "top_k": 1}'
However, the generated outputs of several forwards were different with the same inputs. This situation happened occasionally.
Do you have any clues or ideas about this?
My env info:
CUDA 11.7
nccl 2.14.3
accelerate 0.17.1
Flask 2.2.3
Flask-API 3.0.post1
gunicorn 20.1.0
pydantic 1.10.6
huggingface-hub 0.13.2
Hi, do_sample = true and top_k = 1 should be fine but the correct way to do it is just do_sample = False.
This is weird. I don't this is a bug in the code in this repository.
But will try to give it a shot.
Can you try with just do_sample = False?
Hi @mayank31398 Sorry for the late reply.
It was ok with do_sample=False. The results were all the same.
But I still can't figure out why sampling can't work properly. Do you know who or which repo I can turn to for some help?
Refer to https://huggingface.co/blog/how-to-generate. Because sampling is designed to incorporate randomness into picking the next word.
But the k is 1. There shouldn't be any randomness. @richarddwang