ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[Orca] Save model states on hdfs needs to be enhanced

Open sgwhat opened this issue 2 years ago • 0 comments

Description

When users running Orca pyspark backend programs on Yarn, they must specify the ip and port of hdfs when saving model states, which could cause errors and difficult for users to fix it.

Current Usage

For currently, we require users to must specify the ip and port of hdfs when saving model states.

estimator.from_keras(model, loss, backend="spark", model_dir="hdfs://ip:port/work-dir")

When loading data from hdfs or saving tf model on hdfs, users don't need to specify ip and port, like:

estimator.save("hdfs://work-dir/checkpoint")

Error Caused

Users will use the same way (when loading data or saving checkpoint) to specify model_dir ("hdfs://work-dir"), it could return an error as below:

File "/yarn/nm/usercache/manfei/appcache/application_1646614241228_1372/container_1646614241228_1372_01_000003/environment/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_pyspark_worker.py", line 144, in train_epochs
    save_pkl(state_dict, os.path.join(self.model_dir, "state.pkl"))
  File "/yarn/nm/usercache/manfei/appcache/application_1646614241228_1372/container_1646614241228_1372_01_000003/environment/lib/python3.7/site-packages/bigdl/orca/learn/utils.py", line 471, in save_pkl
    fs = pa.hdfs.connect(host=host_port[0], port=int(host_port[1]))
IndexError: list index out of range

Related Code

https://github.com/intel-analytics/BigDL/blob/45136d037940d62d8cd0412d64e91a05f9a2f370/python/orca/src/bigdl/orca/learn/utils.py#L464

Require users to specify the ip and port of hdfs may be a good solution, since some users don't know how to specify them. We should enhance the code. @jenniew @hkvision

sgwhat avatar Aug 02 '22 03:08 sgwhat