private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Cannot ingest Chinese text file

Open cigoic opened this issue 2 years ago • 17 comments

Hi,

I've tried the following command: python ingest someTChinese.txt

However, it will return some error as shown below, any suggestion? Tried to modify the parameters (chunk_size=500, chunk_overlap=50) and nothing worked here.

llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  =  512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
Using embedded DuckDB with persistence: data will be stored in: db
llama_tokenize: too many tokens
Traceback (most recent call last):
  File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 22, in <module>
    main()
  File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 17, in main
    db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 412, in from_documents
    return cls.from_texts(
           ^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 380, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 158, in add_texts
    embeddings = self._embedding_function.embed_documents(list(texts))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in embed_documents
    embeddings = [self.client.embed(text) for text in texts]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in <listcomp>
    embeddings = [self.client.embed(text) for text in texts]
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 514, in embed
    return list(map(float, self.create_embedding(input)["data"][0]["embedding"]))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 478, in create_embedding
    tokens = self.tokenize(input.encode("utf-8"))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 200, in tokenize
    raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
RuntimeError: Failed to tokenize: text="b'\xe7\x8b\x82\xe5\xaf\x86\xe8\x88\x87\xe7\x9c\x9f\xe5\xaf\x86\xe7\xac\xac\xe4\xb8\x80\xe8\xbc\xaf\n\n\xe3\x80\x80\n\n\xe5\xb9\xb3\xe5\xaf\xa6\xe5\xb1\x85\xe5\xa3\xab\xe8\x91\x97\n\n \xe6\x9c\x89\xe5\xb8\xab\xe4\xba\x91\xef\xbc\x9a\n\n   \xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe4\xb8\x80\xe5\x80\x8b\xe9\x87\x91\xe5\x89\x9b\xe9\x91\xbd\xe5\xa4\x96\xe5\x9c\x8d\xe6\x93\xba\xe6\xbb\xbf\xe4\xba\x86\xe9\x8d\x8d\xe9\x87\x91\xe5\x9e\x83\xe5\x9c\xbe\xe7\x9a\x84\xe5\xae\x97\xe6\x95\x99\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n \xe5\xb9\xb3\xe5\xaf\xa6\xe6\x96\xbc\xe6\xad\xa4\xe8\xa8\x80\xe5\xbe\x8c\xe6\x9b\xb4\xe5\x8a\xa0\xe8\xa8\xbb\xe8\x85\xb3\xef\xbc\x9a\n\n   \xe3\x80\x8c\xe9\x82\xa3\xe9\xa1\x86\xe9\x87\x91\xe5\x85\x89\xe9\x96\x83\xe9\x96\x83\xe7\x9a\x84\xe9\x91\xbd\xe7\x9f\xb3\xef\xbc\x8c\xe5\x8d\xbb\xe6\x98\xaf\xe7\x8e\xbb\xe7\x92\x83\xe6\x89\x93\xe7\xa3\xa8\xe8\x80\x8c\xe6\x88\x90\xef\xbc\x8c\xe4\xb8\x8d\xe5\xa0\xaa\xe6\xaa\xa2\xe9\xa9\x97\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n\xe8\xb7\x8b\n\n\xe6\x95\xb8\xe5\x8d\x81\xe5\xb9\xb4\xe4\xbe\x86\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe7\xb3\xbb\xe7\xb5\xb1\xe8\xab\xb8\xe6\xb3\x95\xe7\x8e\x8b\xe8\x88\x87\xe8\xab\xb8\xe4\xb8\x8a\xe5\xb8\xab\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x92\x90\xe5\x9b\x91\xe8\xab\xb8\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x9a\xe3\x80\x8c\xe8\x8b\xa5\xe6\x9c\x89\xe5\xa4\x96\xe4\xba\xba\xe8\xa9\xa2\xe5\x95\x8f\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe5\x90\xa6\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe4\xbf\xae\xe6\xb3\x95\xef\xbc\x9f\xe6\x87\x89\xe4\xb8\x80\xe5\xbe\x8b\xe7\xad\x94\xe5\xbe\xa9\xef\xbc\x9a\xe3\x80\x8e\xe5\x8f\xa4\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbd\x86\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe7\xb6\x93\xe5\xbb\xa2\xe6\xa3\x84\xe4\xb8\x8d\xe7\x94\xa8\xe3\x80\x82\xe7\x8f\xbe\xe5\x9c\xa8\xe4\xb8\x8a\xe5\xb8\xab\xe9\x83\xbd\xe5\x9a\xb4\xe7\xa6\x81\xe5\xbc\x9f\xe5\xad\x90\xe5\x80\x91\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe6\x89\x80\xe4\xbb\xa5\xe7\x8f\xbe\xe5\x9c\xa8\xe7\x9a\x84\xe5\xaf\x86\xe5\xae\x97\xe5\xb7\xb2\xe7\xb6\x93\xe6\xb2\x92\xe6\x9c\x89\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe5\x8f\x8a\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xba\x86\xe3\x80\x82\xe3\x80\x8f\xe3\x80\x8d\n\n\xe7\x84\xb6\xe8\x80\x8c\xe5\xaf\xa6\xe9\x9a\x9b\xe4\xb8\x8a\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe4\xbb\x8d\xe6\x9c\x89\xe7\x94\x9a\xe5\xa4\x9a\xe4\xb8\x8a\xe5\xb8\xab\xe7\xb9\xbc\xe7\xba\x8c\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xb8\xa6\xe9\x9d\x9e\xe6\x9c\xaa\xe5\x82\xb3\xef\xbc\x8c\xe4\xba\xa6\xe9\x9d\x9e\xe7\xa6\x81\xe5\x82\xb3\xef\xbc\x8c\xe8\x80\x8c\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe5\x9c\xa8\xe7\x89\xa9\xe8\x89\xb2\xe9\x81\xa9\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xb9\x8b\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x8c\xe7\x82\xba\xe5\xbd\xbc\xe7\xad\x89\xe8\xab\xb8\xe4\xba\xba\xe6\x9a\x97\xe4\xb8\xad\xe4\xbd\x9c\xe7\xa7\x98\xe5\xaf\x86\xe7\x81\x8c\xe9\xa0\x82\xef\xbc\x8c\xe4\xbb\xa4\xe5\x85\xb6\xe6\x88\x90\xe7\x82\xba\xe5\x8b\x87\xe7\x88\xb6\xe8\x88\x87\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xef\xbc\x8c\xe7\x84\xb6\xe5\xbe\x8c\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe4\xbb\x8a\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe5\xa6\x82\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\xa5\xe5\xbb\xb6\xe7\xba\x8c\xe5\x85\xb6\xe5\xaf\x86\xe6\xb3\x95\xe4\xb9\x8b\xe5\x91\xbd\xe8\x84\x88\xef\xbc\x9b\xe7\x9d\xbd\xe4\xba\x8e\xe9\x9b\xbb\xe8\xa6\x96\xe6\x96\xb0\xe8\x81\x9e\xe4\xb9\x8b\xe5\xa0\xb1\xe5\xb0\x8e\xe9\x99\xb3\xe5\xb1\xa5\xe5\xae\x89\xe5\x85\xac\xe5\xad\x90\xe9\x99\xb3\xe5\xae\x87\xe5\xbb\xb7\xef\xbc\x8c\xe6\xb1\x82\xe8\x97\x8f\xe5\xa5\xb3\xe7\x82\xba\xe5\x85\xb6\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xe7\xad\x89\xe8\xa8\x80\xef\xbc\x8c\xe4\xba\xa6\xe5\x8f\xaf\xe7\x9f\xa5\xe7\x9f\xa3\xe3\x80\x82\xe6\x98\xaf\xe6\x95\x85\xe5\xaf\x86\xe5\xae\x97\xe5\xbc\x9f\xe5\xad\x90\xe8\x88\x87\xe4\xb8\x8a\xe5\xb8\xab\xe7\xad\x89\xe4\xba\xba\xef\xbc\x8c\xe9\x9b\x96\xe7\x84\xb6\xe5\xb0\x8d\xe5\xa4\x96\xe5\x8f\xa3\xe5\xbe\x91\xe4\xb8\x80\xe8\x87\xb4\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x80\xa1\xe8\xa8\x80\xef\xbc\x9a\xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe6\x8d\xa8\xe6\xa3\x84\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\x8a\xe5\xb7\xb2\xe7\x84\xa1\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe6\x88\x96\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe3\x80\x8d\xe5\x85\xb6\xe5\xaf\xa6\xe6\x98\xaf\xe6\xac\xba\xe7\x9e\x9e\xe7\xa4\xbe\xe6\x9c\x83\xe4\xb9\x8b\xe8\xa8\x80\xef\xbc\x8c\xe4\xbb\xa5\xe6\xad\xa4\xe4\xbb\xa4\xe7\xa4\xbe\xe6\x9c\x83\xe5\xa4\xa7\xe7\x9c\xbe\xe5\xb0\x8d\xe5\xaf\x86\xe5\xae\x97\xe4\xb8\x8d\xe8\x87\xb4\xe5\x8a\xa0\xe4\xbb\xa5\xe5\xa4\xaa\xe5\xa4\x9a\xe4\xb9\x8b\xe6\xb3\xa8\xe6\x84\x8f\xef\xbc\x8c\xe4\xbb\xa5\xe4\xbe\xbf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82'" n_tokens=-723

cigoic avatar May 10 '23 01:05 cigoic

是这样子的,langchain不太能处理中文的内容

dylanxia2017 avatar May 10 '23 17:05 dylanxia2017

Anything to do with the model as well? The default model suggested in this repo is huggingface groovy (english). Dont know how good it is with chinese.

tristan-mcinnis avatar May 11 '23 06:05 tristan-mcinnis

是这样子的,langchain不太能处理中文的内容

太可惜了

cigoic avatar May 11 '23 23:05 cigoic

If you are using the most recent branch try again as the tokens have been increased from 512 to 1000. I noticed when it produces chinese text, it takes a lot more time to print a character so I would assume that it takes more tokens to read too.

@cigoic At the very end of your error n_tokens=-723 which is because the default before commit #58 which fixes an error on commit #53 was 512 tokens.

By doing exactly what you did before with the new default settings, it should work fine as the new default is 1000 tokens. Even with english I was having troubles with only 512 tokens.

If it still is not enough, edit your .env file and change 'MODEL_N_CTX=1000' to a higher number.

Problem solved!

alxspiker avatar May 12 '23 00:05 alxspiker

@alxspiker Thanks for the information!

I pulled the latest version and privateGPT could ingest TChinese file now.

But I notice one thing that it will print a lot of gpt_tokenize: unknown token '' as well while replying my question.

In addition, it won't be able to answer my question related to the article I asked for ingesting. Any suggestion?

cigoic avatar May 12 '23 01:05 cigoic

是这样子的,langchain不太能处理中文的内容

not exactly, it depends on what embedding model you choosed.

Jeru2023 avatar May 13 '23 06:05 Jeru2023

是这样子的,langchain不太能处理中文的内容

not exactly, it depends on what embedding model you choosed.

That actually makes a lot of sense, if you could somehow even just prompt the ingestion model to ingest in chinese, could work better maybe?

alxspiker avatar May 13 '23 07:05 alxspiker

"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.

Jeru2023 avatar May 13 '23 12:05 Jeru2023

GanymedeNil/text2vec-large-chinese

alanhe421 avatar May 14 '23 15:05 alanhe421

Is there any method to translate text files from Chinese to English before ingesting them?

smallyunet avatar May 15 '23 02:05 smallyunet

GanymedeNil/text2vec-large-chinese

It's a derivative model of shibing624/text2vec-base-chinese, replace MacBERT with LERT, and keep other training conditions unchanged.

Jeru2023 avatar May 15 '23 03:05 Jeru2023

I encountered the same issue (too many tokens) in a short Arabic passage in the PaLM 2 Technical Report pdf, published by Google recently where they extoll how good it is with translations using many non-English examples of its prowess. https://ai.google/static/documents/palm2techreport.pdf (palm2techreport.pdf) page 54. That page contains a small amount of English Latin text and a large amount of Arabic text. By the way, some research papers indicate that tokenizing Arabic is a 'hard' problem not well solved by current techniques.

It was difficult to extract a small section to demonstrate the failure because it seems to depend heavily on the chunking points. It would come in at about 948 tokens when processed alone or with adjacent pages but reached over 1100 as the whole pdf document.

I set the number of tokens in the .env file up from 1000 to 1536 (1.5K, a nice round binary number) and ingesting the whole file worked. Thanks @alxspiker for the suggestion.

Are their any adverse implications to increasing this value (it looked like some memory allocation had increased by about 1/2)?

johnbrisbin avatar May 16 '23 17:05 johnbrisbin

GanymedeNil/text2vec-large-chinese

I change file .env like

EMBEDDINGS_MODEL_NAME=GanymedeNil/text2vec-large-chinese

error is:

gpt_tokenize: unknown token ''

hnuzhoulin avatar May 17 '23 17:05 hnuzhoulin

"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.

same error repeat:

gpt_tokenize: unknown token ''

hnuzhoulin avatar May 17 '23 17:05 hnuzhoulin

"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.

same error repeat:

gpt_tokenize: unknown token ''

Validate what you actually get in ingest.py line 75, make sure the param loaded correctly

Jeru2023 avatar May 18 '23 02:05 Jeru2023

can we set up a QQ group to discuss for Chinese issue? my QQ is 84095749

zxjason avatar May 25 '23 19:05 zxjason

Try Faiss see if it works

On Thu, 18 May 2023 at 01:18, hnuzhoulin @.***> wrote:

"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.

when I use this,last error line is:

chromadb.errors.InvalidDimensionException: Dimensionality of (768) does not match index dimensionality (384)

— Reply to this email directly, view it on GitHub https://github.com/imartinez/privateGPT/issues/19#issuecomment-1551786347, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OYG25SR6YAF4JS5I72LV3XGUCABANCNFSM6AAAAAAX4BD4XQ . You are receiving this because you commented.Message ID: @.***>

Jeru2023 avatar May 27 '23 07:05 Jeru2023