djl-serving
djl-serving copied to clipboard
[Not DJLServing]HFPipeline error - GPT Neox
Description
This is not actually an error on DJLServing. Just tracking this here. Will raise an issue in HF as well. HF Pipeline actually trying to generate the outputs on CPU despite including the device_map=auto as configuration for GPT_NeoX 20B model.
Workaround is to use model.generate method by manually converting the input_ids to GPU.
Error Message
Bug: RuntimeError: "topk_cpu" not implemented for 'Half'
How to reproduce?
Trying GPT_NEOX 20B model with our huggingface.py handler.
This was actually recorded as issues in transformers.
https://github.com/huggingface/transformers/issues/18703 https://github.com/huggingface/transformers/issues/19445