Henry Mao
Henry Mao
Try initializing the embedding matrix to uniform distribution drawn from +- `1 / d`.
@sooheon It depends on the particular implementation of your Transformer. Some implementations (Huggingface) scale the embedding by 1 / d before padding it into higher layers while initializing the embedding...
Yes, it would seem reasonable to not decay resweights since other parameters are already being decayed.
@rom1504 @tmbdev I'm running into a similar issue with `gsutil cat` or `gsutil cp`. During mid training for large shards 10GB+ per shard, some network errors occur which the data...
@tmbdev For shards of size 10GB, this would require waiting for the entire file to download prior to loading the data? I guess that's a tradeoff but it will avoid...
@tmbdev Thanks for the info - I'm running this on a server outside of GCloud (but the region is nearby). Downloading the entire object works well - the issue happens...
@RX14 I don't remember owning the domains?
I added a CNAME ci.novaapi.net -> current.rx14.co.uk
If someone else wishes to lead the project, feel free to take over and set up all dependencies.