TensorFlowOnSpark
TensorFlowOnSpark copied to clipboard
Model Saved with TF-2.5.0
dears, excuse me, i have some problem with model saving, when compat.export_saved_model is called, we can see that the model will be saved to different path in chief node and non-chief node, and then how to load the model to predict with different path?
also if the cloud file system we're using in cluster is not supported by TF, can we save the model to local path, and then put the model file to cloud file system manually by code[for example: os.sysyem(hadoop fs -put localPath remotePath) ]?
Do we only need to save the model files on the chief node? can you explain it in detail for us? thank you very much.
This was mostly to account for different APIs and behaviors between different versions of TensorFlow, since different users of TFoS were on different versions of TensorFlow.
Anyhow, if you set the export_dir
to an HDFS path, all chief/worker nodes would attempt to write to the same shared location, leading to various I/O errors. So, this just redirects non-chief workers to write to local disk, while the chief worker writes to HDFS. And in this case, we just consider the HDFS model as source of truth, while the various worker_models
are discarded when the executor containers shut down.
If your cloud filesystem isn't supported by TF, then yes, you could just save to local disk and then copy it to your filesystem later. And in this case, each chief/worker would write to separate local disks, so you have any I/O conflicts, so you shouldn't need this compat
code at all. Instead, you could just use:
model.save(export_dir, save_format='tf')
@leewyang
thanks for your apply. as you said, the various worker_models are discarded when the executor containers shut down, as a result, we only saved the model file in the chief node. So is it a complete model file(ah, i mean, will it missing or lost something)? sorry, maybe i'm confused about the logic of saving model in distribute training.
Yes, it is the complete model, and unfortunately, this is just how TF works at the moment.