Oleksii Kuchaiev
Oleksii Kuchaiev
@1-800-BAD-CODE is this still draft, or ready for review?
@blisc @junkin and @ryanleary what's the status of this? Do we need this PR? If not, someone, please close
can you try lowering learning rate? 1/10 or 1/100 of whatever you are using. Also, what is the range of your labels, e.g. 1-5 or some other range?
Yes, I think @paulhendricks is right - wide middle layer with *large dropout* allows it to learn robust representations. Regarding first layers (e.g. first encoder layer) and last layer (e.g....
We did train for longer than 36 epochs. But we monitored *validation* error and then picked a checkpoint (was around step 40 I think) with the best validation error.
@whrichd can you please fix code style? ``` pip install -r requirements/requirements_test.txt ``` and then ``` python setup.py style --fix ```
is this necessary for r1.12 ?
@stevehuang52 we'd like to "deprecate" non-Megatron transformers in NeMo. Can you please have a look at whether you can use those?
Can you please share the link to the Tuda-de dataset you are using? Also, 127 hours seems too small for Jasper10x5 - perhaps you could try a smaller version first?