alias-free-gan-pytorch icon indicating copy to clipboard operation
alias-free-gan-pytorch copied to clipboard

Tips for running in google colab & resuming from checkpoint

Open Luke2642 opened this issue 4 years ago • 7 comments

Thanks for this repo, it's great!

To get it working in colab, I copied the bare minimum out from the docker file:

!pip install jsonnet !apt install -y -q ninja-build !pip install tensorfn rich

!pip install setuptools !pip install numpy scipy nltk lmdb cython pydantic pyhocon

!apt install libsm6 libxext6 libxrender1 !pip install opencv-python-headless

It then works despite throwing two compatibility errors:

ERROR: requests 2.23.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.6 which is incompatible. ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

I then made some manual edits to config/config-t.jsonnet so it runs on colab:

Under training:{} set image size to 128 Under training:{} batch size to 12 (650mb each so <8gb I guess)

In prepare_data.py I commented out line 14 for no resizing, just cropping. Could be useful config for some datasets.

In train.py main function line 322 and comment out 5 "logger" lines. the logger info didn't work, it just hangs then falls over without error out of the box in colab but I didn't investigate further.

I also couldn't get --ckpt=checkpoint/010000.pt to resume properly. I tried editing start iteration in the config too but no luck, it just seemed to start from zero again.

Also, it may be worth editing train.py with autocast() for half precision float16 instead of float32 to improve speed and memory limitations? Or even porting to TPU? https://github.com/pytorch/xla

So then run

!git clone https://github.com/rosinality/alias-free-gan-pytorch.git

After making these edits

#upload your zip file or use google drive import !unzip /content/dataraw.zip -d /content/dataraw

%cd /content/alias-free-gan-pytorch !python prepare_data.py --out /content/dataset --n_worker 8 --size=128 /content/dataraw

%cd /content/alias-free-gan-pytorch !python train.py --n_gpu 1 --conf config/config-t.jsonnet path=/content/dataset/

Thanks again!

Luke2642 avatar Jul 14 '21 12:07 Luke2642

You can avoid a lot of those extra steps, also you can run it at 256 by just reducing the batch size as long as Colab gives you a GPU with 16Gb. You can make this work with just a few cells:

# Install dependencies and clone repo
!pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3 tensorfn jsonnet
!git clone https://github.com/rosinality/alias-free-gan-pytorch.git
# Move into the cloned repo
%cd alias-free-gan-pytorch
# Here download your dataset, or copy/unzip it from drive, or whatever
# prepare the dataset and crop the images to 256x256
%run prepare_data.py --out my_dataset --n_worker 4 --size 256 "your/dataset/folder"
# now train
%run train.py --n_gpu 1 --conf config/config-t.jsonnet training.batch=4 path=my_dataset

I believe a batch size of 8 should also work without problems, but I set 4 just in case

pabloppp avatar Jul 14 '21 13:07 pabloppp

Thanks, that's fantastic help, and much easier!

Are you able to offer any advice on resuming from checkpoints, or did I (probably) make a mistake?

Luke2642 avatar Jul 14 '21 14:07 Luke2642

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed. I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

pabloppp avatar Jul 14 '21 18:07 pabloppp

I hadn't tried resuming until now, and you're correct, it seems to be broken 🤔 no matter what you pass as an argument, it will be displayed as None when the config is printed. I managed to get it working by adding an extra ckpt parameter inside training but it's just a bad hack to make it work 🤔

Would it be possible for you to post your hack until the problem is addressed at a deeper level?

MHRosenberg avatar Jul 16 '21 05:07 MHRosenberg

Sure, just edit config.py and and under class Training(Config): add ckpt: str = None like this:

class Training(Config):
    size: StrictInt
    iter: StrictInt = 800000
    batch: StrictInt = 16
    n_sample: StrictInt = 32
    r1: float = 10
    d_reg_every: StrictInt = 16
    lr_g: float = 2e-3
    lr_d: float = 2e-3
    augment: StrictBool = False
    augment_p: float = 0
    ada_target: float = 0.6
    ada_length: StrictInt = 500 * 1000
    ada_every: StrictInt = 256
    start_iter: StrictInt = 0
    ckpt: str = None

Then in train.py you need to change the 4 occurrences of conf.ckpt with conf.training.ckpt

if conf.training.ckpt is not None:
        logger.info(f"load model: {conf.training.ckpt}")

        ckpt = torch.load(conf.training.ckpt, map_location=lambda storage, loc: storage)

        try:
            ckpt_name = os.path.basename(conf.training.ckpt)
            conf.training.start_iter = int(os.path.splitext(ckpt_name)[0])

        except ValueError:
            pass

And that's it, then when you run the training script, you can pass the argument training.ckpt="checkpoint/060000.pt" and it should load it and resume training instead of restarting from scratch.

pabloppp avatar Jul 16 '21 07:07 pabloppp

duskvirkus made a notebook in Pytorch Lightning -> https://colab.research.google.com/github/duskvirkus/alias-free-gan-pytorch/blob/main/notebooks/AliasFreeGAN_lightning_basic_training.ipynb

ucalyptus2 avatar Jul 22 '21 19:07 ucalyptus2

Has anyone succeed at training with 512 x 512 or 1024 x 1024? I succeed with 256 x 256 but have been struggling with higher resolutions. I'm using the same input dataset in both cases but I hit PIL errors which I hacked around via sidphbot's approach but I appear to still have some issue with loading real images and get: "AttributeError: 'bytes' object has no attribute 'seek'". Any ideas?

MHRosenberg avatar Aug 19 '21 10:08 MHRosenberg