LLMLingua icon indicating copy to clipboard operation
LLMLingua copied to clipboard

Output for High Token Languages like Japanese

Open choprahetarth opened this issue 1 year ago • 2 comments

While the concept is promising, especially for High Token Languages like Japanese, I've encountered a significant encoding issue.

Steps to Reproduce: Input a Japanese text prompt into LLMLingua for compression. Observe the output, which should be a compressed version of the original prompt. Expected Behavior: The compressed output should retain the original Japanese characters without any encoding errors.

Actual Behavior: The output contains a mix of unrecognized characters along with some correct Japanese script. This mixed encoding makes the compressed prompt unusable when passed into GPT-4. A B

choprahetarth avatar Jan 18 '24 03:01 choprahetarth

Hi @choprahetarth, thank you for your interest in and support of LLMLingua.

This is a known issue, as seen in #4. We'll address it soon as detailed in #51.

iofu728 avatar Jan 18 '24 12:01 iofu728

Is there anything I can contribute to? I seem to be interested in it quite a lot. My stack is around Python/ML/PyTorch, but I am not sure which issue to pick first.

choprahetarth avatar Feb 02 '24 05:02 choprahetarth