Daniel Perry

Results 9 comments of Daniel Perry

It looks like the error is likely coming from the choice of checkpoint I passed into `train.py` using `model_name_or_path`. Starting fine-tuning using checkpoint `dalle-mini/dalle-mini/mega-1-fp16:latest`, I get the error mentioned, but...

Currently untested: * Multi GPU setup since I only have one * Theoretically supports loading from checkpoints that didn't use lookahead originally

Tested Multi-GPU using Azure and verified that it at least ran for >100 iterations and produced expected outputs. That's about as much as I can validate for now.

Also tested resuming training after starting without lookahead to confirm that works as well.

Friendly ping to @lucidrains 😄. My own testing with lookahead resulted in excellent improvements of outputs when training without attention, I'm interested to see if others see similar improvements. My...

It looks like something in PyTorch changed in the past year that makes the code not work. I promise it did work when I made the PR 😄. Unfortunately, I...

I'm noticing the same on my own dataset of ~175k text-image pairs, so maybe it's not a dataset size issue (or 200k is also not enough)? To add my own...

My attempt with the larger batch size is still going without any NaNs so far in about 62 hours of training on my 3090. Currently the loss is hovering around...

@jacobwjs Unfortunately, my machine power-cycled itself for some reason, so training on my x-clip model has stopped for now. I wanted to test out lucidrains's imagen model with my text-image...