EmbedAI gpt_tokenize: unknown token '?

from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285 LLM0 GPT4All Params: {'model': 'models/ggml-gpt4all-j-v1.3-groovy.bin', 'n_predict': 256, 'n_threads': 4, 'top_k': 40, 'top_p': 0.95, 'temp': 0.8}

Serving Flask app 'privateGPT'
Debug mode: off [2023-05-31 10:39:11,833] {_internal.py:186} INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Running on all addresses (0.0.0.0)
Running on http://127.0.0.1:5000
Running on http://10.253.1.21:5000 [2023-05-31 10:39:11,834] {_internal.py:186} INFO - Press CTRL+C to quit Loading documents from source_documents Loaded 1 documents from source_documents Split into 90 chunks of text (max. 500 characters each) [2023-05-31 10:39:47,710] {_internal.py:186} INFO - 127.0.0.1 - - [31/May/2023 10:39:47] "GET /ingest HTTP/1.1" 200 - [2023-05-31 10:40:04,057] {_internal.py:186} INFO - 127.0.0.1 - - [31/May/2023 10:40:04] "OPTIONS /get_answer HTTP/1.1" 200 - gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '? gpt_tokenize: unknown token '€' gpt_tokenize: unknown token '? gpt_tokenize: unknown token '?

How can I fix the issue ?

May 31 '23 03:05 mark420524

For the important_tokens which contain several actual words (like frankie_and_bennys), you can replace underscore with the space and feed them normally, Or add them as a special token. I prefer the first option because this way you can use pre-trained embedding for their subtokens. For the ones which aren't actual words (like cb17dy), you must add them as special tokens.

from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") your_string = '[PRED] name [SUB] frankie and bennys frankie_and_bennys [PRED] cb17dy' SPECIAL_TOKENS = { "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "pad_token": "[PAD]", "additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy'] } tokenizer.add_special_tokens(SPECIAL_TOKENS) print(tokenizer(your_string)['input_ids']) print(tokenizer.convert_ids_to_tokens(tokenizer(your_string)['input_ids']))

May 31 '23 07:05 RamiSJ12

This looks like a common issue with Python 3.8 . You can upgrade to Python 3.10 and it should work

May 31 '23 13:05 Anil-matcha

This looks like a common issue with Python 3.8 . You can upgrade to Python 3.10 and it should work

I upgrade python to 3.11 , it works. thanks

Jun 01 '23 02:06 mark420524

EmbedAI EmbedAI copied to clipboard

gpt_tokenize: unknown token '?

EmbedAI
EmbedAI copied to clipboard