BigDL-2.x
BigDL-2.x copied to clipboard
[BigDL2.0] autoestimator_pytorch hdfs path can not save model on k8s
http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL-NB-K8s-ExampleTests/152/console
[2m[36m(pid=244, ip=172.30.27.4)[0m /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
[2m[36m(pid=244, ip=172.30.27.4)[0m return torch.from_numpy(inp)
[2m[36m(pid=244, ip=172.30.27.4)[0m
0%| | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
[2m[36m(pid=244, ip=172.30.27.4)[0m allow_unreachable=True) # allow_unreachable flag
[2m[36m(pid=244, ip=172.30.27.4)[0m
Loss: 0.6922382116317749: 0%| | 0/16 [00:00<?, ?it/s]
Loss: 0.4504893720149994: 6%|▋ | 1/16 [00:00<00:00, 50.22it/s]
[2m[36m(pid=244, ip=172.30.27.4)[0m
Loss: 0.27864789962768555: 12%|█▎ | 2/16 [00:00<00:00, 82.55it/s]
Loss: 0.18915259838104248: 19%|█▉ | 3/16 [00:00<00:00, 106.19it/s]
Loss: 0.112899050116539: 25%|██▌ | 4/16 [00:00<00:00, 124.31it/s]
Loss: 0.09547075629234314: 31%|███▏ | 5/16 [00:00<00:00, 138.47it/s]
Loss: 0.029641583561897278: 38%|███▊ | 6/16 [00:00<00:00, 150.55it/s]
Loss: 0.056755051016807556: 44%|████▍ | 7/16 [00:00<00:00, 160.61it/s]
Loss: 0.019430123269557953: 50%|█████ | 8/16 [00:00<00:00, 170.19it/s]
Loss: 0.002557608764618635: 56%|█████▋ | 9/16 [00:00<00:00, 178.60it/s]
Loss: 0.004579346626996994: 62%|██████▎ | 10/16 [00:00<00:00, 185.35it/s]
Loss: 0.0019340637372806668: 69%|██████▉ | 11/16 [00:00<00:00, 192.40it/s]
Loss: 0.00223898165859282: 75%|███████▌ | 12/16 [00:00<00:00, 198.61it/s]
Loss: 0.005255652591586113: 81%|████████▏ | 13/16 [00:00<00:00, 200.80it/s]
Loss: 0.00018203322542831302: 88%|████████▊ | 14/16 [00:00<00:00, 206.26it/s]
Loss: 0.055765699595212936: 94%|█████████▍| 15/16 [00:00<00:00, 212.25it/s]
Loss: 0.055765699595212936: 100%|██████████| 16/16 [00:00<00:00, 225.74it/s]
[2m[36m(pid=245, ip=172.30.27.4)[0m /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
[2m[36m(pid=245, ip=172.30.27.4)[0m return torch.from_numpy(inp)
[2m[36m(pid=245, ip=172.30.27.4)[0m
0%| | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
[2m[36m(pid=245, ip=172.30.27.4)[0m allow_unreachable=True) # allow_unreachable flag
[2m[36m(pid=245, ip=172.30.27.4)[0m
Loss: 0.6456587314605713: 0%| | 0/16 [00:00<?, ?it/s]
[2m[36m(pid=244, ip=172.30.27.4)[0m 2021-11-04 00:35:35,556 ERROR function_runner.py:254 -- Runner Thread raised error.
[2m[36m(pid=244, ip=172.30.27.4)[0m Traceback (most recent call last):
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
[2m[36m(pid=244, ip=172.30.27.4)[0m self._entrypoint()
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
[2m[36m(pid=244, ip=172.30.27.4)[0m self._status_reporter.get_checkpoint())
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
[2m[36m(pid=244, ip=172.30.27.4)[0m output = fn()
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
[2m[36m(pid=244, ip=172.30.27.4)[0m if remote_ckpt_basename not in get_remote_list(remote_dir):
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
[2m[36m(pid=244, ip=172.30.27.4)[0m s_output, _ = process(args)
[2m[36m(pid=244, ip=172.30.27.4)[0m TypeError: cannot unpack non-iterable NoneType object
[2m[36m(pid=244, ip=172.30.27.4)[0m Exception in thread Thread-2:
[2m[36m(pid=244, ip=172.30.27.4)[0m Traceback (most recent call last):
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/threading.py", line 926, in _bootstrap_inner
[2m[36m(pid=244, ip=172.30.27.4)[0m self.run()
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 267, in run
[2m[36m(pid=244, ip=172.30.27.4)[0m raise e
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
[2m[36m(pid=244, ip=172.30.27.4)[0m self._entrypoint()
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
[2m[36m(pid=244, ip=172.30.27.4)[0m self._status_reporter.get_checkpoint())
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
[2m[36m(pid=244, ip=172.30.27.4)[0m output = fn()
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
[2m[36m(pid=244, ip=172.30.27.4)[0m if remote_ckpt_basename not in get_remote_list(remote_dir):
[2m[36m(pid=244, ip=172.30.27.4)[0m File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
[2m[36m(pid=244, ip=172.30.27.4)[0m s_output, _ = process(args)
[2m[36m(pid=244, ip=172.30.27.4)[0m TypeError: cannot unpack non-iterable NoneType object
[2m[36m(pid=244, ip=172.30.27.4)[0m
[2m[36m(pid=245, ip=172.30.27.4)[0m
Loss: 0.4749995172023773: 6%|▋ | 1/16 [00:00<00:00, 48.86it/s]
Loss: 0.3644247055053711: 12%|█▎ | 2/16 [00:00<00:00, 81.42it/s]
Loss: 0.19700123369693756: 19%|█▉ | 3/16 [00:00<00:00, 105.65it/s]
Loss: 0.15083497762680054: 25%|██▌ | 4/16 [00:00<00:00, 123.93it/s]
Loss: 0.1125955805182457: 31%|███▏ | 5/16 [00:00<00:00, 138.76it/s]
Loss: 0.07053384184837341: 38%|███▊ | 6/16 [00:00<00:00, 150.92it/s]
Loss: 0.04681260883808136: 44%|████▍ | 7/16 [00:00<00:00, 161.47it/s]
Loss: 0.02035798318684101: 50%|█████ | 8/16 [00:00<00:00, 170.66it/s]
Loss: 0.012909774668514729: 56%|█████▋ | 9/16 [00:00<00:00, 178.95it/s]
Loss: 0.0078040556982159615: 62%|██████▎ | 10/16 [00:00<00:00, 186.17it/s]
Loss: 0.04752806946635246: 69%|██████▉ | 11/16 [00:00<00:00, 192.78it/s]
Loss: 0.019220085814595222: 75%|███████▌ | 12/16 [00:00<00:00, 198.82it/s]
Loss: 0.010350744239985943: 81%|████████▏ | 13/16 [00:00<00:00, 200.81it/s]
Loss: 0.0005109629710204899: 88%|████████▊ | 14/16 [00:00<00:00, 206.25it/s]
[2m[36m(pid=244, ip=172.30.27.4)[0m
[2m[36m(pid=244, ip=172.30.27.4)[0m /bin/sh: hdfs: command not found
@yushan111
AutoEstimator
currently only supports distributed on clusters with hdfs, therefore doesn't support k8s for now.