private-gpt gpt_tokenize: unknown token ''

I'm trying to run the PrivateGPR from a docker, so I created the below:

Dockerfile:

# Use the python-slim version of Debian as the base image
FROM python:slim

# Update the package index and install any necessary packages
RUN apt-get update -y 
RUN apt-get install -y gcc build-essential gfortran pkg-config libssl-dev g++
RUN pip3 install --upgrade pip
RUN apt-get clean

# Set the working directory to /app
WORKDIR /app

# Copy the requirements.txt and script.py files into the container
COPY requirements.txt .

RUN pip3 install --no-cache-dir -r requirements.txt --force-reinstall

COPY /script /app
COPY /models /app/models
COPY /knowledge /app/knowledge

EXPOSE 5000

# Set the default command to run when the container starts
# In this case, we're using the tail command to continuously output the contents of /dev/null
CMD ["tail", "-f", "/dev/null"]

docker-compose.yml

version: '3.8' # Specify the version of the docker-compose file format

services: # Define the services that make up your application
  app: # Name of the service
    build: # Configuration for building the Docker image for this service
      context: . # Path to the directory containing the Dockerfile
      dockerfile: Dockerfile # Name of the Dockerfile to use
    image: my-repo/my-image-name # Name of the Docker image to use or build
    container_name: my-container-name # Name of the container to create

.env

KNOWLEDGE_PATH=/app/knowledge
PERSIST_DIRECTORY=/app/db
MODEL_TYPE=GPT4All
MODEL_PATH=/app/models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L12-v2
MODEL_N_CTX=1000

Requirements.txt

transformers==4.29.2
torch==2.0.1
numexpr==2.8.4
langchain==0.0.171
pygpt4all==1.1.0
chromadb==0.3.23
llama-cpp-python==0.1.50
urllib3==2.0.2
pdfminer.six==20221105
flask==2.3.2
nicegui==1.2.14
streamlit==1.22.0
streamlit-extras==0.2.7

Downloaded the:

Invoke-WebRequest -Uri "https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin" -OutFile "models\ggml-gpt4all-j-v1.3-groovy.bin"

Download the [state_of_the_union.txt](https://github.com/imartinez/privateGPT/blob/main/source_documents/state_of_the_union.txt) file
Run the below:

docker-compose up -d --build

The Docker image had been created succeffuly and the image had been run as well:

At the container terminal I run the below succesfuly:

# python ingest.py
# python privateGPT.py
 Enter a query: exit

Once I tried to enter another query, I got the error: gpt_tokenize: unknown token ''

May 20 '23 12:05 hajsf

See https://github.com/imartinez/privateGPT/issues/180 and https://github.com/imartinez/privateGPT/issues/214 This is a duplicate of many other issues.

May 20 '23 12:05 PulpCattel

``I have this too. I notice when I vary the input string - the amount of unknown tokens errors changes in my system. So, I think it something to do with LangChain processing the input string.

I went back to the sample GPT4All program and used it to just read the doc. Got no token errors, but I did have to clean the doc a bit to read it into python. the state of the union text should be cleaned on the github portal.

I ran this on i7-8865U @ 1.9 GHz - 4 core, 8 logical and it still took 5 minutes to do the sample program. Still a bit slow. but it works. probably need to see way of what chromadb shovels over to it.

`from gpt4all import GPT4All with open("./source_documents/state_of_the_union.txt") as f: text1 = f.read()

gptj = GPT4All("ggml-gpt4all-j-v1.3-groovy", "./models/") messages = [{"role": "user", "content": "summerize the following text: " + text1[:2000]}] res = gptj.chat_completion(messages, streaming=False)

print(res) `

May 20 '23 15:05 jon2allen

I'm trying to run the PrivateGPR from a docker, so I created the below:

Dockerfile:

# Use the python-slim version of Debian as the base image
FROM python:slim

# Update the package index and install any necessary packages
RUN apt-get update -y 
RUN apt-get install -y gcc build-essential gfortran pkg-config libssl-dev g++
RUN pip3 install --upgrade pip
RUN apt-get clean

# Set the working directory to /app
WORKDIR /app

# Copy the requirements.txt and script.py files into the container
COPY requirements.txt .

RUN pip3 install --no-cache-dir -r requirements.txt --force-reinstall

COPY /script /app
COPY /models /app/models
COPY /knowledge /app/knowledge

EXPOSE 5000

# Set the default command to run when the container starts
# In this case, we're using the tail command to continuously output the contents of /dev/null
CMD ["tail", "-f", "/dev/null"]

docker-compose.yml

version: '3.8' # Specify the version of the docker-compose file format

services: # Define the services that make up your application
  app: # Name of the service
    build: # Configuration for building the Docker image for this service
      context: . # Path to the directory containing the Dockerfile
      dockerfile: Dockerfile # Name of the Dockerfile to use
    image: my-repo/my-image-name # Name of the Docker image to use or build
    container_name: my-container-name # Name of the container to create

.env

KNOWLEDGE_PATH=/app/knowledge
PERSIST_DIRECTORY=/app/db
MODEL_TYPE=GPT4All
MODEL_PATH=/app/models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L12-v2
MODEL_N_CTX=1000

Requirements.txt

transformers==4.29.2
torch==2.0.1
numexpr==2.8.4
langchain==0.0.171
pygpt4all==1.1.0
chromadb==0.3.23
llama-cpp-python==0.1.50
urllib3==2.0.2
pdfminer.six==20221105
flask==2.3.2
nicegui==1.2.14
streamlit==1.22.0
streamlit-extras==0.2.7

Downloaded the:

Invoke-WebRequest -Uri "https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin" -OutFile "models\ggml-gpt4all-j-v1.3-groovy.bin"

Download the [state_of_the_union.txt](https://github.com/imartinez/privateGPT/blob/main/source_documents/state_of_the_union.txt) file
Run the below:

docker-compose up -d --build

The Docker image had been created succeffuly and the image had been run as well:

At the container terminal I run the below succesfuly:

# python ingest.py
# python privateGPT.py
 Enter a query: exit

Once I tried to enter another query, I got the error: gpt_tokenize: unknown token ''

https://github.com/imartinez/privateGPT/issues/328#issue-1718160410

May 20 '23 19:05 Hunter-Stack

I think there are some strange chars in the default text provied by the author. I changed the content and it disappeared.

May 21 '23 07:05 Ellen7ions

I think there are some strange chars in the default text provied by the author. I changed the content and it disappeared.

Can you share the content you used please, so I can check it. thanks

May 21 '23 07:05 hajsf

I think there are some strange chars in the default text provied by the author. I changed the content and it disappeared.

Can you share the content you used please, so I can check it. thanks

fantastic dockerfile, can you make a repo for that?

May 23 '23 20:05 zxjason

private-gpt private-gpt copied to clipboard

gpt_tokenize: unknown token ''

private-gpt
private-gpt copied to clipboard