Jonathan Schmidt
Jonathan Schmidt
With distributed:true the training hangs at "building line graphs" even when only using one node. However there seem to be quite a lot of issues with hanging processes in accelerate....
@knc6 just a quick check in concerning data parallel as the distributed was removed. I am getting some device errors with dataparallel and I am also not sure whether dataparallel...
Great, will try it out.
Thank for sharing the branch. I tested it with cached datasets and 2 gpus and it was reproducible and consistent with what I would expect from 1 gpu. However I...
that's a good idea, lmdb datasets definitely work for this. If you would like to use lmdb datasets, there are a few examples of how to do lmdb datasets in...
Thank you very much. Will give it a try this week.
@utf Just a reminder to take a look if you find the time. So I know whether this is going in the direction intended.
Thank you for taking a look @utf . I will take care of it next week.
the error should be unrelated to this PR
@utf if you are happy with the corrections it should be ready to merge