Henry Mao comments

Results 57 comments of


Henry Mao

when apply rezero to bert or gpt, get NAN gradients

Try initializing the embedding matrix to uniform distribution drawn from +- `1 / d`.

when apply rezero to bert or gpt, get NAN gradients

@sooheon It depends on the particular implementation of your Transformer. Some implementations (Huggingface) scale the embedding by 1 / d before padding it into higher layers while initializing the embedding...

weight decay for the resweight?

Yes, it would seem reasonable to not decay resweights since other parameters are already being decayed.

gsutil cat intermittently fails

@rom1504 @tmbdev I'm running into a similar issue with `gsutil cat` or `gsutil cp`. During mid training for large shards 10GB+ per shard, some network errors occur which the data...

gsutil cat intermittently fails

@tmbdev For shards of size 10GB, this would require waiting for the entire file to download prior to loading the data? I guess that's a tradeoff but it will avoid...

gsutil cat intermittently fails

@tmbdev Thanks for the info - I'm running this on a server outside of GCloud (but the region is nearby). Downloading the entire object works well - the issue happens...

The NOVA Maven server is down

@RX14 I don't remember owning the domains?

The NOVA Maven server is down

I added a CNAME ci.novaapi.net -> current.rx14.co.uk

The NOVA Maven server is down

Done

The NOVA Maven server is down

If someone else wishes to lead the project, feel free to take over and set up all dependencies.