Frédéric Bastien
Frédéric Bastien
As no news and I think it is fixed, I'll close it. If you still see it, just reopen this bug.
I'm curious about your issue. If you run it many times, is it always the same GPUs that have an issue? If so, can you try this: CUDA_VISIBLE_DEVICES=2,3,4,0,1 python your_script.py...
Thanks for the results. What computer is this? 5 GPUs isn't a frequent config. What GPUs it is? Are they all the same? If not (like on DGX stations), if...
> Thanks for the results. What computer is this? 5 GPUs isn't a frequent config. What GPUs it is? Are they all the same? If not (like on DGX stations),...
If you limit yourself to only the 4 first GPU, does it work correctly? Also, what is the motherboard? Few motherboard can have 7 GPUs.
From this page: https://www.supermicro.com/en/support/resources/gpu?rsc=fltr_sku%3DSYS-420GP-TNR The A100 GPU isn't officially supported by this server. https://www.supermicro.com/en/products/system/GPU/4U/SYS-420GP-TNR Sorry, I do not have a magic answer. Did you test other frameworks then JAX?
I can give you one. If the Theano function take no input, it will have less overhead. OK, for medium model this is not significant. Also, it can be called...
Just bumping this again as I saw this problem elsewhere too.
I don't know anyone using Pylearn2 anymore. It was with Lasagne: https://github.com/MarcCote/sb_resnet/blob/master/sb/sb_resnet.py#L133 The fix in that repo: https://github.com/MarcCote/sb_resnet/commit/41074fa2d65befb243dcee395498098000aa43f0
This is a limitation of the random number implementation that can't make more then 2**31-1 samples per call. You can lower the batch size to request less samples at a...