Added docker-file and edited the instructions for it
close #40 . i have added the docker-file and added the instructions to it.
Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.
@rpj09 are you still working on this? I'd like to help. I can make some time to work on it
@rpj09 are you still working on this? I'd like to help. I can make some time to work on it
Yeah sure , actually i got busy in sem exams.
Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.
Hey @csris , apologies for replying this late .
When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.
To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.
To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.
Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.
Hey @csris , apologies for replying this late .
When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.
To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.
To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.
Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.
I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works
# Base image
FROM ubuntu:20.04
VOLUME /app
# Set working directory
WORKDIR /app
# Update and install required packages
RUN apt-get update && \
apt-get install git-lfs wget gcc -y && \
rm -rf /var/lib/apt/lists/*
# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH=/app/conda/bin:${PATH}
# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml
# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate OpenChatKit" >> ~/.bashrc
# Copy OpenChatKit code
COPY . .
# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
#
# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]
Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.
Want me to make a pr to your branch?
Hey @csris , apologies for replying this late . When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother. To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it. To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother. Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.
I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works
# Base image FROM ubuntu:20.04 VOLUME /app # Set working directory WORKDIR /app # Update and install required packages RUN apt-get update && \ apt-get install git-lfs wget gcc -y && \ rm -rf /var/lib/apt/lists/* # Download and install Miniconda RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \ rm Miniconda3-latest-Linux-x86_64.sh ENV PATH=/app/conda/bin:${PATH} # Create OpenChatKit environment COPY environment.yml . RUN conda install mamba -n base -c conda-forge RUN mamba env create -f environment.yml # Set conda to automatically activate base environment on login RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \ echo "conda activate OpenChatKit" >> ~/.bashrc # Copy OpenChatKit code COPY . . # Optional code to prepare for finetuning # Install Git LFS # RUN git lfs install # # Set entrypoint to bash shell ENTRYPOINT ["/bin/bash"]Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.
Want me to make a pr to your branch?
Sure @orangetin
I got the docker file working using a kinda different method. Instead of just opening up an empty shell, I wrote a bash script that executes when the docker container is run, which then runs the required scripts for prepping the data, training, and/or cmd inference.
Plus, with the volume method mentioned above, it'll be easy to handle the downloading of the required datasets.
In the mean time, @csris should I open a new pr for that or merge it in here?
Here's the branch: https://github.com/orangetin/OpenChatKit/tree/docker It modifies the original dockerfile and adds a new bash script.
Build command:
sudo docker build -t openchatkit .
~~Sample run command:
sudo docker run -it --rm --volume $(pwd):/app openchatkit --model togethercomputer/Pythia-Chat-Base-7B~~
EDIT: I've updated the files to use micromamba instead of miniconda/mamba because launching the container took forever with miniconda.
~~EDIT 2:
Other sample commands:
sudo docker run -it --rm openchatkit prepare --bitsandbytes # run prepare scripts and install bitsandbytes
sudo docker run -it --rm openchatkit train --model gpt-neox # train the gpt-neox model
sudo docker run -it --rm openchatkit train # defaults model to 'pythia'
sudo docker run -it --rm openchatkit --model togethercomputer/GPT-NeoXT-Chat-Base-20B~~
I've edited the branch mentioned above. I was able to shrink the image size from 20.5 GB to 13.8 GB by clearing conda cache.
Here are the updated commands:
Inference:
sudo docker create -i -t --name inference --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit --model togethercomputer/Pythia-Chat-Base-7B
sudo docker start inference -a
Prepare for training:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit prepare
Train:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit train -m pythia
^ The cache directory is where huggingface saves downloaded models. By mounting it as a volume, the downloaded model can be shared by multiple containers. This also makes it efficient to launch multiple containers (like multiple inference instances) concurrently without using up more disk space.
This needs more testing though.
Edit: The above method successfully loads the model into the gpu/cpu but does not produce outputs. It goes into an 'EOF' error loop. The reason for this is docker doesn't play nice with bash inputs from a python script inside a container. Making a docker container should work for training, but performance may not be the best. I'd say we should wait until we have a working gradio interface before continuing with this for inference.