text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

text generation details not working when stream=False

Open uyeongkim opened this issue 1 year ago • 1 comments

System Info

I ran docker with model-id with downloaded lamma3 model, from huggingface. And I requested with python code below

from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient("http://127.0.0.1:8080")


output = await client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)
print(output)

but It does not displays details, TextGenerationOutput(generated_text='100% open-source and available on GitHub. It is distributed', details=None)

and server log is like

2024-05-10T09:32:15.955615Z  INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-rtx-a6000"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(12), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="1.425314571s" validation_time="477.908µs" queue_time="66.966µs" inference_time="1.42476984s" time_per_token="118.73082ms" seed="None"}: text_generation_router::server: router/src/server.rs:309: Success

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient("http://127.0.0.1:8080")


output = await client.text_generation("The huggingface_hub library is ", max_new_tokens=12, details=True)
print(output)
2024-05-10T09:32:15.955615Z  INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-rtx-a6000"))}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(12), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="1.425314571s" validation_time="477.908µs" queue_time="66.966µs" inference_time="1.42476984s" time_per_token="118.73082ms" seed="None"}: text_generation_router::server: router/src/server.rs:309: Success

Expected behavior

text generate should give details instead of None

uyeongkim avatar May 10 '24 09:05 uyeongkim

@uyeongkim I opened a similar issue at: https://github.com/huggingface/huggingface_hub/issues/2281

Related issue for stream=True: https://github.com/huggingface/text-generation-inference/issues/1530

Since you use stream=False, using simply requests instead of huggingface_hub should work for you:

import requests

session = requests.Session()


# url = "http://0.0.0.0:80/generate_stream"
url = "http://0.0.0.0:80/generate"
data = {"inputs": "Today I am in Paris and", "parameters": {"max_new_tokens": 20}}
headers = {"Content-Type": "application/json"}

response = requests.post(url, json=data, headers=headers)

response = session.post(
    url,
    json=data,
    headers=headers,
    stream=False, # True,
)

# for line in response.iter_lines():
#     print(f"line: `{line}`")

print(response.headers)

fxmarty avatar May 14 '24 13:05 fxmarty

It looks like this is a regression in huggingface_hub package, because it doesn't reproduce on older versions, like 0.20.0

kdamaszk avatar May 27 '24 14:05 kdamaszk

@uyeongkim @kdamaszk This was indeed a regression. A hot-fix release has been shipped: https://github.com/huggingface/huggingface_hub/releases/tag/v0.23.3. See related PR for more details: https://github.com/huggingface/huggingface_hub/pull/2316.

Note: this was not a bug in text-generation-inference itself.

Wauplin avatar Jun 11 '24 14:06 Wauplin