TensorFlowOnSpark icon indicating copy to clipboard operation
TensorFlowOnSpark copied to clipboard

Model Saved with TF-2.5.0

Open doufs opened this issue 3 years ago • 3 comments

dears, excuse me, i have some problem with model saving, when compat.export_saved_model is called, we can see that the model will be saved to different path in chief node and non-chief node, and then how to load the model to predict with different path? image

also if the cloud file system we're using in cluster is not supported by TF, can we save the model to local path, and then put the model file to cloud file system manually by code[for example: os.sysyem(hadoop fs -put localPath remotePath) ]?

Do we only need to save the model files on the chief node? can you explain it in detail for us? thank you very much.

doufs avatar Sep 23 '21 09:09 doufs

This was mostly to account for different APIs and behaviors between different versions of TensorFlow, since different users of TFoS were on different versions of TensorFlow.

Anyhow, if you set the export_dir to an HDFS path, all chief/worker nodes would attempt to write to the same shared location, leading to various I/O errors. So, this just redirects non-chief workers to write to local disk, while the chief worker writes to HDFS. And in this case, we just consider the HDFS model as source of truth, while the various worker_models are discarded when the executor containers shut down.

If your cloud filesystem isn't supported by TF, then yes, you could just save to local disk and then copy it to your filesystem later. And in this case, each chief/worker would write to separate local disks, so you have any I/O conflicts, so you shouldn't need this compat code at all. Instead, you could just use:

model.save(export_dir, save_format='tf')

leewyang avatar Sep 23 '21 20:09 leewyang

@leewyang image thanks for your apply. as you said, the various worker_models are discarded when the executor containers shut down, as a result, we only saved the model file in the chief node. So is it a complete model file(ah, i mean, will it missing or lost something)? sorry, maybe i'm confused about the logic of saving model in distribute training.

doufs avatar Sep 27 '21 07:09 doufs

Yes, it is the complete model, and unfortunately, this is just how TF works at the moment.

leewyang avatar Sep 27 '21 17:09 leewyang