hivemind
hivemind copied to clipboard
Simple mistakes trigger unclear error messages in the ALBERT example
Simple mistakes trigger unclear error messages in the ALBERT example, that is:
- [x] Absence of the unpacked data for trainer (currently triggers
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer) - [ ] Running all peers in
--client_mode(currently triggersAllReduce failed: could not find a group)
It would be great to show a clear error message in these cases.
Hi Alexander, I find that running with only one regular trainer(+training monitor) also triggers the second problem. I guess it should skip the averaging when there's only one trainer?
Hi @soodoshll,
Thanks for the report! In this case, training proceeds normally, so you can proceed to connecting more peers despite the error.
To be honest, I was not sure if we should remove it, since hivemind is designed to be used with several trainers, and a trainer finding itself to be alone is rather a symptom of connection issues in real training runs.
However, I now agree that this message is confusing for people running our example for the first time, so we should remove it.