text-generation-inference
text-generation-inference copied to clipboard
text generation details not working when stream=False
System Info
I ran docker with model-id with downloaded lamma3 model, from huggingface. And I requested with python code below
from huggingface_hub import AsyncInferenceClient
client = AsyncInferenceClient("http://127.0.0.1:8080")
output = await client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)
print(output)
but It does not displays details, TextGenerationOutput(generated_text='100% open-source and available on GitHub. It is distributed', details=None)
and server log is like
2024-05-10T09:32:15.955615Z INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-rtx-a6000"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(12), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="1.425314571s" validation_time="477.908µs" queue_time="66.966µs" inference_time="1.42476984s" time_per_token="118.73082ms" seed="None"}: text_generation_router::server: router/src/server.rs:309: Success
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
from huggingface_hub import AsyncInferenceClient
client = AsyncInferenceClient("http://127.0.0.1:8080")
output = await client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)
print(output)
2024-05-10T09:32:15.955615Z INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-rtx-a6000"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(12), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="1.425314571s" validation_time="477.908µs" queue_time="66.966µs" inference_time="1.42476984s" time_per_token="118.73082ms" seed="None"}: text_generation_router::server: router/src/server.rs:309: Success
Expected behavior
text generate should give details instead of None
@uyeongkim I opened a similar issue at: https://github.com/huggingface/huggingface_hub/issues/2281
Related issue for stream=True: https://github.com/huggingface/text-generation-inference/issues/1530
Since you use stream=False, using simply requests instead of huggingface_hub should work for you:
import requests
session = requests.Session()
# url = "http://0.0.0.0:80/generate_stream"
url = "http://0.0.0.0:80/generate"
data = {"inputs": "Today I am in Paris and", "parameters": {"max_new_tokens": 20}}
headers = {"Content-Type": "application/json"}
response = requests.post(url, json=data, headers=headers)
response = session.post(
url,
json=data,
headers=headers,
stream=False, # True,
)
# for line in response.iter_lines():
# print(f"line: `{line}`")
print(response.headers)
It looks like this is a regression in huggingface_hub package, because it doesn't reproduce on older versions, like 0.20.0
@uyeongkim @kdamaszk This was indeed a regression. A hot-fix release has been shipped: https://github.com/huggingface/huggingface_hub/releases/tag/v0.23.3. See related PR for more details: https://github.com/huggingface/huggingface_hub/pull/2316.
Note: this was not a bug in text-generation-inference itself.