Shiqing Fan comments

Results 6 comments of


                                            Shiqing Fan

SocketCaffeNet UT should be enhanced

I'm basicaly familiar with the CaffeOnSpark codebase and have been developing on it for several months. What I mean is why not add a complete train test for `socketnet` who's...

SocketCaffeNet UT should be enhanced

Thanks! @anfeng Actually for my case I have changed the native CaffeOnSpark code framework and now I need to verify the correctness of my changes so that it keeps working...

Abnormal memory allocation of transformer with GPipe on V100-16GB

Detailed breakdown of memory allocation from `tfprof` is listed as follows: BATCH_SIZE=32 VOCAB_SIZE = 32000 EMBEDDING_DIM = 2048 MAX_SEQUENCE_LENGTH = 1024 node name | requested bytes | total execution time...

MOE training Loss inconsistent after resume from old checkpoint

Hi @guozhen1997 , we are also debugging on this issue. I will ping you when we find the root cause ASAP.

MOE training Loss inconsistent after resume from old checkpoint

Hi @guozhen1997 , this issue is caused by an incorrect implementation of the dual-optimizer state loading function, the fix MR is under review and will be published soon.

MOE training Loss inconsistent after resume from old checkpoint

Hi @guozhen1997 and @binxuan , this issue has already been fixed by this [commit](https://github.com/NVIDIA/Megatron-LM/commit/1505db4cc4e9e94ee22583c76f7e425ea34f5aea).