MarkusSpanring
MarkusSpanring
@VitalyFedyunin is there any update on this?
Quick update @SsnL @VitalyFedyunin @ejguan @NivekT @ngimel I was able to reproduce the behavior also on the following architectures (same conda env and same driver) ``` GeForce GTX 1650 Tesla...
@thuningxu unfortunately I do not have permission to update the driver on the compute I am working on. I will try it out as soon as there is a newer...
@nhtlongcs using pkill is exactly what caused the problem in the first place. @ejguan @btravouillon the drivers have been updated to `520.61.05` now and I tried to reproduce the behavior...
@thoglu my naive guess is that the driver update solved the issue for me. At least I can not reproduce the behavior after the update. I have tested it with...
@thoglu I must admit that I have not checked (yet). I have kept the hack till now since it added not much overhead. However, I can test as soon as...
+1 for updating to 1.6.
Quick update on this: Even though I thought `persistent_workers=True` cleans the processes properly I found that something very weird happens. Namely, the `BAR1 Memory Usage` is not released. In the...
@scv119 not yet. FYI, I was able to boil it down to the PyTorch DataLoader. I have opened an [issue](https://github.com/pytorch/pytorch/issues/66482) already but there is no comment/fix yet.
@JiahaoYao if you have time, could you check if `_init_deterministic(True)` is sufficient to replicate `Trainer(deterministic=True)` on all workers?