PaddleCloud Some points found in debugging.

If pserver can't save checkpoint, the job can't be fault-tolerant, should the whole job exit?
If num_sample == 0 when we do evaluator, what cost should be displayed?
Should we support split a job with a file-list not only recordio format.And then user can use their format without convert to recordio file.
- And should we support split txt file to tasks?
It's easy to confuse to call it cloud_reader because we can cloud_reader train data and so we can read test data from the cloud? So rename it to get_task_train_data?

Oct 24 '17 12:10 gongweibao

If pserver can't save checkpoint, the job can't be fault-tolerant, should the whole job exit?

We don't consider pserver fault tolerant currently. It's not a blocking issue, yet we need to fix it later.

If num_sample == 0 when we do evaluator, what cost should be displayed?

If num_sample == 0 trainer.py would never call event_handler

Should we support split a job with a file-list not only recordio format.

I agree, I remember there was an issue for that.

Oct 24 '17 13:10 typhoonzero

If pserver can't save checkpoint, the job can't be fault-tolerant, should the whole job exit?

Did you found that pserver can't save checkpoint? I have a fix here: https://github.com/PaddlePaddle/Paddle/pull/5053

Should we support split a job with a file-list not only recordio format.

That is a good point, probably not for this experiment. (this can be fixed after refactor, everything becomes an OP).

It's easy to confuse to call it cloud_reader because we can cloud_reader train data and so we can read test data from the cloud? So rename it to get_task_train_data?

Do you mean we need to express the fact that cloud_reader is just for training, not for testing? Yes that's is a good point. get_task_train_data is a candidate, maybe we need to think for more candidates before we decide. Let's not worry about this one now, and focus on finishing the experiment and fixing all the bugs found by the experiment.

Oct 24 '17 20:10 helinwang

Ye, all of these are just for discussion and should not block the experiment.

Oct 25 '17 01:10 gongweibao

PaddleCloud PaddleCloud copied to clipboard

Some points found in debugging.

PaddleCloud
PaddleCloud copied to clipboard