Gustaf Ahdritz comments

Results 170 comments of


Gustaf Ahdritz

Training duration & NaNs during training

Are you using PyTorch Lightning's built-in timer? If so, it's a running average of total time elapsed between iterations, including time spent loading data. Usually for me, the time starts...

Training duration & NaNs during training

I'm sorry to hear that you're getting NaNs. Two thoughts: 1. The model should be written in such a way that it just skips over examples that yield NaN loss,...

Training duration & NaNs during training

Hm. Peculiar. The `num_workers` parameter is handed straight to a native PyTorch DataLoader, so I can't say off the top of my head why that might be happening. I'll look...

Training duration & NaNs during training

Sorry for the delay. Despite testing with a number of values of `batch_size` and `num_workers`, I am unable to reproduce the behavior you described. How are you changing the batch...

Training duration & NaNs during training

With enough workers, the speedup is about linear in the size of the batches. `scripts/download_all_data.sh` downloads the AlphaFold training set. The validation set is from CAMEO. What exactly happens when...

Training duration & NaNs during training

I've figured out what's happening. Like you said, at some point in the model activations become NaN. In general, for certain inputs, this seems unavoidable---it's a fundamental limitation of fp16...

Training duration & NaNs during training

I still haven't been able to reproduce the batch/worker issue.

Training duration & NaNs during training

I usually get around 500-700% utilization. I've done small overfitting experiments, and I bottomed out near zero loss.

Training duration & NaNs during training

That's issue #197. I'll be fixing it soon.

train error about chain_data_cache

The chain_data_cache.json needs to be generated for the training set. Could you elaborate on chain_data_cache being too small?