private-gpt
private-gpt copied to clipboard
Cannot ingest Chinese text file
Hi,
I've tried the following command:
python ingest someTChinese.txt
However, it will return some error as shown below, any suggestion? Tried to modify the parameters (chunk_size=500, chunk_overlap=50) and nothing worked here.
llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
llama_tokenize: too many tokens
Traceback (most recent call last):
File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 22, in <module>
main()
File "/Users/jmosx/devp/allAboutGPT/privateGPT/ingest.py", line 17, in main
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 412, in from_documents
return cls.from_texts(
^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 380, in from_texts
chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 158, in add_texts
embeddings = self._embedding_function.embed_documents(list(texts))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in embed_documents
embeddings = [self.client.embed(text) for text in texts]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/langchain/embeddings/llamacpp.py", line 111, in <listcomp>
embeddings = [self.client.embed(text) for text in texts]
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 514, in embed
return list(map(float, self.create_embedding(input)["data"][0]["embedding"]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 478, in create_embedding
tokens = self.tokenize(input.encode("utf-8"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jmosx/miniconda3/envs/privateGPT/lib/python3.11/site-packages/llama_cpp/llama.py", line 200, in tokenize
raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')
RuntimeError: Failed to tokenize: text="b'\xe7\x8b\x82\xe5\xaf\x86\xe8\x88\x87\xe7\x9c\x9f\xe5\xaf\x86\xe7\xac\xac\xe4\xb8\x80\xe8\xbc\xaf\n\n\xe3\x80\x80\n\n\xe5\xb9\xb3\xe5\xaf\xa6\xe5\xb1\x85\xe5\xa3\xab\xe8\x91\x97\n\n \xe6\x9c\x89\xe5\xb8\xab\xe4\xba\x91\xef\xbc\x9a\n\n \xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe4\xb8\x80\xe5\x80\x8b\xe9\x87\x91\xe5\x89\x9b\xe9\x91\xbd\xe5\xa4\x96\xe5\x9c\x8d\xe6\x93\xba\xe6\xbb\xbf\xe4\xba\x86\xe9\x8d\x8d\xe9\x87\x91\xe5\x9e\x83\xe5\x9c\xbe\xe7\x9a\x84\xe5\xae\x97\xe6\x95\x99\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n \xe5\xb9\xb3\xe5\xaf\xa6\xe6\x96\xbc\xe6\xad\xa4\xe8\xa8\x80\xe5\xbe\x8c\xe6\x9b\xb4\xe5\x8a\xa0\xe8\xa8\xbb\xe8\x85\xb3\xef\xbc\x9a\n\n \xe3\x80\x8c\xe9\x82\xa3\xe9\xa1\x86\xe9\x87\x91\xe5\x85\x89\xe9\x96\x83\xe9\x96\x83\xe7\x9a\x84\xe9\x91\xbd\xe7\x9f\xb3\xef\xbc\x8c\xe5\x8d\xbb\xe6\x98\xaf\xe7\x8e\xbb\xe7\x92\x83\xe6\x89\x93\xe7\xa3\xa8\xe8\x80\x8c\xe6\x88\x90\xef\xbc\x8c\xe4\xb8\x8d\xe5\xa0\xaa\xe6\xaa\xa2\xe9\xa9\x97\xe3\x80\x82\xe3\x80\x8d\n\n\xe3\x80\x80\n\n\xe8\xb7\x8b\n\n\xe6\x95\xb8\xe5\x8d\x81\xe5\xb9\xb4\xe4\xbe\x86\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe7\xb3\xbb\xe7\xb5\xb1\xe8\xab\xb8\xe6\xb3\x95\xe7\x8e\x8b\xe8\x88\x87\xe8\xab\xb8\xe4\xb8\x8a\xe5\xb8\xab\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x92\x90\xe5\x9b\x91\xe8\xab\xb8\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x9a\xe3\x80\x8c\xe8\x8b\xa5\xe6\x9c\x89\xe5\xa4\x96\xe4\xba\xba\xe8\xa9\xa2\xe5\x95\x8f\xe5\xaf\x86\xe5\xae\x97\xe6\x98\xaf\xe5\x90\xa6\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe4\xbf\xae\xe6\xb3\x95\xef\xbc\x9f\xe6\x87\x89\xe4\xb8\x80\xe5\xbe\x8b\xe7\xad\x94\xe5\xbe\xa9\xef\xbc\x9a\xe3\x80\x8e\xe5\x8f\xa4\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe6\x9c\x89\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbd\x86\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe7\xb6\x93\xe5\xbb\xa2\xe6\xa3\x84\xe4\xb8\x8d\xe7\x94\xa8\xe3\x80\x82\xe7\x8f\xbe\xe5\x9c\xa8\xe4\xb8\x8a\xe5\xb8\xab\xe9\x83\xbd\xe5\x9a\xb4\xe7\xa6\x81\xe5\xbc\x9f\xe5\xad\x90\xe5\x80\x91\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe6\x89\x80\xe4\xbb\xa5\xe7\x8f\xbe\xe5\x9c\xa8\xe7\x9a\x84\xe5\xaf\x86\xe5\xae\x97\xe5\xb7\xb2\xe7\xb6\x93\xe6\xb2\x92\xe6\x9c\x89\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe5\x8f\x8a\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xba\x86\xe3\x80\x82\xe3\x80\x8f\xe3\x80\x8d\n\n\xe7\x84\xb6\xe8\x80\x8c\xe5\xaf\xa6\xe9\x9a\x9b\xe4\xb8\x8a\xef\xbc\x8c\xe8\xa5\xbf\xe8\x97\x8f\xe5\xaf\x86\xe5\xae\x97\xe4\xbb\x8d\xe6\x9c\x89\xe7\x94\x9a\xe5\xa4\x9a\xe4\xb8\x8a\xe5\xb8\xab\xe7\xb9\xbc\xe7\xba\x8c\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xb8\xa6\xe9\x9d\x9e\xe6\x9c\xaa\xe5\x82\xb3\xef\xbc\x8c\xe4\xba\xa6\xe9\x9d\x9e\xe7\xa6\x81\xe5\x82\xb3\xef\xbc\x8c\xe8\x80\x8c\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe5\x9c\xa8\xe7\x89\xa9\xe8\x89\xb2\xe9\x81\xa9\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe4\xb9\x8b\xe5\xbc\x9f\xe5\xad\x90\xef\xbc\x8c\xe7\x82\xba\xe5\xbd\xbc\xe7\xad\x89\xe8\xab\xb8\xe4\xba\xba\xe6\x9a\x97\xe4\xb8\xad\xe4\xbd\x9c\xe7\xa7\x98\xe5\xaf\x86\xe7\x81\x8c\xe9\xa0\x82\xef\xbc\x8c\xe4\xbb\xa4\xe5\x85\xb6\xe6\x88\x90\xe7\x82\xba\xe5\x8b\x87\xe7\x88\xb6\xe8\x88\x87\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xef\xbc\x8c\xe7\x84\xb6\xe5\xbe\x8c\xe5\x90\x88\xe4\xbf\xae\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe4\xbb\x8a\xe6\x99\x82\xe5\xaf\x86\xe5\xae\x97\xe5\xa6\x82\xe6\x98\xaf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\xa5\xe5\xbb\xb6\xe7\xba\x8c\xe5\x85\xb6\xe5\xaf\x86\xe6\xb3\x95\xe4\xb9\x8b\xe5\x91\xbd\xe8\x84\x88\xef\xbc\x9b\xe7\x9d\xbd\xe4\xba\x8e\xe9\x9b\xbb\xe8\xa6\x96\xe6\x96\xb0\xe8\x81\x9e\xe4\xb9\x8b\xe5\xa0\xb1\xe5\xb0\x8e\xe9\x99\xb3\xe5\xb1\xa5\xe5\xae\x89\xe5\x85\xac\xe5\xad\x90\xe9\x99\xb3\xe5\xae\x87\xe5\xbb\xb7\xef\xbc\x8c\xe6\xb1\x82\xe8\x97\x8f\xe5\xa5\xb3\xe7\x82\xba\xe5\x85\xb6\xe7\xa9\xba\xe8\xa1\x8c\xe6\xaf\x8d\xe7\xad\x89\xe8\xa8\x80\xef\xbc\x8c\xe4\xba\xa6\xe5\x8f\xaf\xe7\x9f\xa5\xe7\x9f\xa3\xe3\x80\x82\xe6\x98\xaf\xe6\x95\x85\xe5\xaf\x86\xe5\xae\x97\xe5\xbc\x9f\xe5\xad\x90\xe8\x88\x87\xe4\xb8\x8a\xe5\xb8\xab\xe7\xad\x89\xe4\xba\xba\xef\xbc\x8c\xe9\x9b\x96\xe7\x84\xb6\xe5\xb0\x8d\xe5\xa4\x96\xe5\x8f\xa3\xe5\xbe\x91\xe4\xb8\x80\xe8\x87\xb4\xef\xbc\x8c\xe6\x82\x89\xe7\x9a\x86\xe5\x80\xa1\xe8\xa8\x80\xef\xbc\x9a\xe3\x80\x8c\xe5\xaf\x86\xe5\xae\x97\xe7\x8f\xbe\xe5\x9c\xa8\xe5\xb7\xb2\xe6\x8d\xa8\xe6\xa3\x84\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xef\xbc\x8c\xe4\xbb\x8a\xe5\xb7\xb2\xe7\x84\xa1\xe4\xba\xba\xe5\xbc\x98\xe5\x82\xb3\xe6\x88\x96\xe4\xbf\xae\xe5\xad\xb8\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82\xe3\x80\x8d\xe5\x85\xb6\xe5\xaf\xa6\xe6\x98\xaf\xe6\xac\xba\xe7\x9e\x9e\xe7\xa4\xbe\xe6\x9c\x83\xe4\xb9\x8b\xe8\xa8\x80\xef\xbc\x8c\xe4\xbb\xa5\xe6\xad\xa4\xe4\xbb\xa4\xe7\xa4\xbe\xe6\x9c\x83\xe5\xa4\xa7\xe7\x9c\xbe\xe5\xb0\x8d\xe5\xaf\x86\xe5\xae\x97\xe4\xb8\x8d\xe8\x87\xb4\xe5\x8a\xa0\xe4\xbb\xa5\xe5\xa4\xaa\xe5\xa4\x9a\xe4\xb9\x8b\xe6\xb3\xa8\xe6\x84\x8f\xef\xbc\x8c\xe4\xbb\xa5\xe4\xbe\xbf\xe7\xb9\xbc\xe7\xba\x8c\xe6\x9a\x97\xe4\xb8\xad\xe5\xbc\x98\xe5\x82\xb3\xe9\x9b\x99\xe8\xba\xab\xe6\xb3\x95\xe3\x80\x82'" n_tokens=-723
是这样子的,langchain不太能处理中文的内容
Anything to do with the model as well? The default model suggested in this repo is huggingface groovy (english). Dont know how good it is with chinese.
是这样子的,langchain不太能处理中文的内容
太可惜了
If you are using the most recent branch try again as the tokens have been increased from 512 to 1000. I noticed when it produces chinese text, it takes a lot more time to print a character so I would assume that it takes more tokens to read too.
@cigoic At the very end of your error
n_tokens=-723
which is because the default before commit #58 which fixes an error on commit #53 was 512 tokens.
By doing exactly what you did before with the new default settings, it should work fine as the new default is 1000 tokens. Even with english I was having troubles with only 512 tokens.
If it still is not enough, edit your .env file and change 'MODEL_N_CTX=1000' to a higher number.
Problem solved!
@alxspiker Thanks for the information!
I pulled the latest version and privateGPT could ingest TChinese file now.
But I notice one thing that it will print a lot of gpt_tokenize: unknown token '' as well while replying my question.
In addition, it won't be able to answer my question related to the article I asked for ingesting. Any suggestion?
是这样子的,langchain不太能处理中文的内容
not exactly, it depends on what embedding model you choosed.
是这样子的,langchain不太能处理中文的内容
not exactly, it depends on what embedding model you choosed.
That actually makes a lot of sense, if you could somehow even just prompt the ingestion model to ingest in chinese, could work better maybe?
"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.
GanymedeNil/text2vec-large-chinese
Is there any method to translate text files from Chinese to English before ingesting them?
GanymedeNil/text2vec-large-chinese
It's a derivative model of shibing624/text2vec-base-chinese, replace MacBERT with LERT, and keep other training conditions unchanged.
I encountered the same issue (too many tokens) in a short Arabic passage in the PaLM 2 Technical Report pdf, published by Google recently where they extoll how good it is with translations using many non-English examples of its prowess. https://ai.google/static/documents/palm2techreport.pdf (palm2techreport.pdf) page 54. That page contains a small amount of English Latin text and a large amount of Arabic text. By the way, some research papers indicate that tokenizing Arabic is a 'hard' problem not well solved by current techniques.
It was difficult to extract a small section to demonstrate the failure because it seems to depend heavily on the chunking points. It would come in at about 948 tokens when processed alone or with adjacent pages but reached over 1100 as the whole pdf document.
I set the number of tokens in the .env file up from 1000 to 1536 (1.5K, a nice round binary number) and ingesting the whole file worked. Thanks @alxspiker for the suggestion.
Are their any adverse implications to increasing this value (it looked like some memory allocation had increased by about 1/2)?
GanymedeNil/text2vec-large-chinese
I change file .env like
EMBEDDINGS_MODEL_NAME=GanymedeNil/text2vec-large-chinese
error is:
gpt_tokenize: unknown token ''
"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.
same error repeat:
gpt_tokenize: unknown token ''
"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.
same error repeat:
gpt_tokenize: unknown token ''
Validate what you actually get in ingest.py line 75, make sure the param loaded correctly
can we set up a QQ group to discuss for Chinese issue? my QQ is 84095749
Try Faiss see if it works
On Thu, 18 May 2023 at 01:18, hnuzhoulin @.***> wrote:
"shibing624/text2vec-base-chinese" is the SOTA sentence embedding model for Chinese vector so far.
when I use this,last error line is:
chromadb.errors.InvalidDimensionException: Dimensionality of (768) does not match index dimensionality (384)
— Reply to this email directly, view it on GitHub https://github.com/imartinez/privateGPT/issues/19#issuecomment-1551786347, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OYG25SR6YAF4JS5I72LV3XGUCABANCNFSM6AAAAAAX4BD4XQ . You are receiving this because you commented.Message ID: @.***>