adanet icon indicating copy to clipboard operation
adanet copied to clipboard

Add distributed candidate evaluation support

Open picarus opened this issue 6 years ago • 3 comments

Hello, I am running Adanet 0.5.0 in GCP with Runtime version 1.10. I am using a CPU configuration with multiple nodes. The training phase is very fast but it gets totally slowed down by the evaluations. The evaluation don't seem to take advantage of the multiple nodes and the logs are flooded with "Waiting for chief to finish" messages coming from the workers and generated by Adanet Estimator. I think support for evaluation phase to use the multiple nodes should be added and that be a priority change as not only the nodes are not used, you also keep paying for them. Is that feasible? Thanks in advance Jose

picarus avatar Dec 21 '18 02:12 picarus

@picarus: This is a known issue when using the adanet.Evaluator in distributed training.

One way you can make the evaluation much faster is to pass the steps argument to its constructor (e.g. steps=100). This will end evaluation after n batches instead of evaluating over the full dataset. Alternatively if you do not pass an Evaluator to the adanet.Estimator, the Estimator will use a moving average of the train loss to determine the best candidate, and skip evaluation altogether. This option is fine if you have only a single candidate per iteration.

You're right that distributed evaluation should be a supported feature, unfortunately it is non-trivial to implement. Do you have any suggestions how you can shard evaluation across all the workers given an arbitrary input_fn?

cweill avatar Dec 21 '18 15:12 cweill

@cweill , I lack the deep knowledge you sure have about Adanet or even TF but unless you suggest the problem to implement this is on TF I don't see additional complexity other than the fact that you are evaluating multiple networks. Is it a TF issue?

picarus avatar Jan 07 '19 01:01 picarus

@picarus: Unfortunately nothing is very straightforward in TF. :)

The challenges I see are:

  • Making sure this works for any number of workers and candidate subnetworks.
  • How to synchronize the workers so they don't look at the same data. Otherwise you may have incorrect metrics when evaluating. This is easy to do on one worker, but it's not obvious to me how to do it on multiple servers.

If you have any suggestions or pull request, I'm happy to chat more.

cweill avatar Jan 07 '19 02:01 cweill