ipex-llm
ipex-llm copied to clipboard
[Orca] Save model states on hdfs needs to be enhanced
Description
When users running Orca pyspark backend programs on Yarn, they must specify the ip
and port
of hdfs when saving model states, which could cause errors and difficult for users to fix it.
Current Usage
For currently, we require users to must specify the ip
and port
of hdfs when saving model states.
estimator.from_keras(model, loss, backend="spark", model_dir="hdfs://ip:port/work-dir")
When loading data from hdfs or saving tf model on hdfs, users don't need to specify ip
and port
, like:
estimator.save("hdfs://work-dir/checkpoint")
Error Caused
Users will use the same way (when loading data or saving checkpoint) to specify model_dir
("hdfs://work-dir"
), it could return an error as below:
File "/yarn/nm/usercache/manfei/appcache/application_1646614241228_1372/container_1646614241228_1372_01_000003/environment/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_pyspark_worker.py", line 144, in train_epochs
save_pkl(state_dict, os.path.join(self.model_dir, "state.pkl"))
File "/yarn/nm/usercache/manfei/appcache/application_1646614241228_1372/container_1646614241228_1372_01_000003/environment/lib/python3.7/site-packages/bigdl/orca/learn/utils.py", line 471, in save_pkl
fs = pa.hdfs.connect(host=host_port[0], port=int(host_port[1]))
IndexError: list index out of range
Related Code
https://github.com/intel-analytics/BigDL/blob/45136d037940d62d8cd0412d64e91a05f9a2f370/python/orca/src/bigdl/orca/learn/utils.py#L464
Require users to specify the ip
and port
of hdfs may be a good solution, since some users don't know how to specify them. We should enhance the code. @jenniew @hkvision