Frédéric Bastien comments

Results 185 comments of


                                            Frédéric Bastien

ptxas error : Entry function 'fusion_##' uses too much shared data

As no news and I think it is fixed, I'll close it. If you still see it, just reopen this bug.

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

I'm curious about your issue. If you run it many times, is it always the same GPUs that have an issue? If so, can you try this: CUDA_VISIBLE_DEVICES=2,3,4,0,1 python your_script.py...

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Thanks for the results. What computer is this? 5 GPUs isn't a frequent config. What GPUs it is? Are they all the same? If not (like on DGX stations), if...

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

> Thanks for the results. What computer is this? 5 GPUs isn't a frequent config. What GPUs it is? Are they all the same? If not (like on DGX stations),...

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

If you limit yourself to only the 4 first GPU, does it work correctly? Also, what is the motherboard? Few motherboard can have 7 GPUs.

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

From this page: https://www.supermicro.com/en/support/resources/gpu?rsc=fltr_sku%3DSYS-420GP-TNR The A100 GPU isn't officially supported by this server. https://www.supermicro.com/en/products/system/GPU/4U/SYS-420GP-TNR Sorry, I do not have a magic answer. Did you test other frameworks then JAX?

Frédéric Bastien

ptxas error : Entry function 'fusion_##' uses too much shared data

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Bogus gradient value from value_and_grad for committed DeviceArray on a multi-GPU host

Is there interest for implementing flexible training loops, batch iteration schemes, etc. [offer of code]

TypeError: An update must have the same type as the original shared variable

TypeError: An update must have the same type as the original shared variable

Added sparse submodule and test for sparse input layers