OpenChatKit Added docker-file and edited the instructions for it

close #40 . i have added the docker-file and added the instructions to it.

Mar 16 '23 13:03 rpj09

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

Mar 18 '23 04:03 csris

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

Apr 22 '23 04:04 orangetin

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

Yeah sure , actually i got busy in sem exams.

Apr 22 '23 05:04 rpj09

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

Apr 22 '23 05:04 rpj09

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile

# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]

Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

Apr 22 '23 06:04 orangetin

Hey @csris , apologies for replying this late . When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother. To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it. To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother. Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile
# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]
Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

Sure @orangetin

Apr 22 '23 11:04 rpj09

I got the docker file working using a kinda different method. Instead of just opening up an empty shell, I wrote a bash script that executes when the docker container is run, which then runs the required scripts for prepping the data, training, and/or cmd inference.

Plus, with the volume method mentioned above, it'll be easy to handle the downloading of the required datasets.

In the mean time, @csris should I open a new pr for that or merge it in here?

Here's the branch: https://github.com/orangetin/OpenChatKit/tree/docker It modifies the original dockerfile and adds a new bash script.

Build command: sudo docker build -t openchatkit .

~~Sample run command: sudo docker run -it --rm --volume $(pwd):/app openchatkit --model togethercomputer/Pythia-Chat-Base-7B~~

EDIT: I've updated the files to use micromamba instead of miniconda/mamba because launching the container took forever with miniconda.

~~EDIT 2: Other sample commands: sudo docker run -it --rm openchatkit prepare --bitsandbytes # run prepare scripts and install bitsandbytes sudo docker run -it --rm openchatkit train --model gpt-neox # train the gpt-neox model sudo docker run -it --rm openchatkit train # defaults model to 'pythia' sudo docker run -it --rm openchatkit --model togethercomputer/GPT-NeoXT-Chat-Base-20B~~

Apr 22 '23 23:04 orangetin

I've edited the branch mentioned above. I was able to shrink the image size from 20.5 GB to 13.8 GB by clearing conda cache.

Here are the updated commands:

Inference:

sudo docker create -i -t --name inference --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit --model togethercomputer/Pythia-Chat-Base-7B

sudo docker start inference -a

Prepare for training: sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit prepare Train: sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit train -m pythia

^ The cache directory is where huggingface saves downloaded models. By mounting it as a volume, the downloaded model can be shared by multiple containers. This also makes it efficient to launch multiple containers (like multiple inference instances) concurrently without using up more disk space.

This needs more testing though.

Edit: The above method successfully loads the model into the gpu/cpu but does not produce outputs. It goes into an 'EOF' error loop. The reason for this is docker doesn't play nice with bash inputs from a python script inside a container. Making a docker container should work for training, but performance may not be the best. I'd say we should wait until we have a working gradio interface before continuing with this for inference.

Apr 24 '23 14:04 orangetin