instruct-pix2pix-dataset
instruct-pix2pix-dataset copied to clipboard
This repository provides utilities to a minimal dataset for InstructPix2Pix like training for Diffusion models.
This repository provides utilities to a minimal dataset for InstructPix2Pix like training for Diffusion models.
Steps
-
Download the original dataset as discussed here. I used this version:
clip-filtered-dataset
. Note that the download can take as long as 24 hours depending on the internet bandwidth. The dataset also requires at least 600 GB of storage. -
Then run:
python make_dataset.py --data_root clip-filtered-dataset --num_samples_to_use 1000
-
The
make_dataset.py
was specifically designed to obtain a 🤗 dataset. So, it's the most useful when you push the minimal dataset to the 🤗 Hub. You can do so by settingpush_to_hub
while runningmake_dataset.py
.
Example dataset
https://huggingface.co/datasets/sayakpaul/instructpix2pix-1000-samples
data:image/s3,"s3://crabby-images/255e5/255e56d0440a08e0afd1a6eba6e62ad9a3c08420" alt="image"
The full version of the CLIP filtered dataset used for InstructPix2Pix training can be found here: https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered
With the dataset being on the 🤗 Hub, one can do load the dataset with two lines of code:
from datasets import load_dataset
dataset = load_dataset("timbrooks/instructpix2pix-clip-filtered", split="train")
And voila 🤗
Acknowledgements
The structure of make_dataset.py
is inspired by Nate Raw's notebook.