hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

Simple mistakes trigger unclear error messages in the ALBERT example

Open borzunov opened this issue 4 years ago • 2 comments

Simple mistakes trigger unclear error messages in the ALBERT example, that is:

  • [x] Absence of the unpacked data for trainer (currently triggers requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer)
  • [ ] Running all peers in --client_mode (currently triggers AllReduce failed: could not find a group)

It would be great to show a clear error message in these cases.

borzunov avatar Sep 21 '21 20:09 borzunov

Hi Alexander, I find that running with only one regular trainer(+training monitor) also triggers the second problem. I guess it should skip the averaging when there's only one trainer?

soodoshll avatar Dec 28 '21 01:12 soodoshll

Hi @soodoshll,

Thanks for the report! In this case, training proceeds normally, so you can proceed to connecting more peers despite the error.

To be honest, I was not sure if we should remove it, since hivemind is designed to be used with several trainers, and a trainer finding itself to be alone is rather a symptom of connection issues in real training runs.

However, I now agree that this message is confusing for people running our example for the first time, so we should remove it.

borzunov avatar Dec 28 '21 02:12 borzunov