Oleksii Kuchaiev

Results 9 comments of Oleksii Kuchaiev

@blisc @junkin and @ryanleary what's the status of this? Do we need this PR? If not, someone, please close

can you try lowering learning rate? 1/10 or 1/100 of whatever you are using. Also, what is the range of your labels, e.g. 1-5 or some other range?

Yes, I think @paulhendricks is right - wide middle layer with *large dropout* allows it to learn robust representations. Regarding first layers (e.g. first encoder layer) and last layer (e.g....

We did train for longer than 36 epochs. But we monitored *validation* error and then picked a checkpoint (was around step 40 I think) with the best validation error.

@whrichd can you please fix code style? ``` pip install -r requirements/requirements_test.txt ``` and then ``` python setup.py style --fix ```

@stevehuang52 we'd like to "deprecate" non-Megatron transformers in NeMo. Can you please have a look at whether you can use those?

Can you please share the link to the Tuda-de dataset you are using? Also, 127 hours seems too small for Jasper10x5 - perhaps you could try a smaller version first?