simclr
simclr copied to clipboard
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: File system scheme '[local]' not implemented (file: 'pretrain_model/model.ckpt-0_temp_8abb8541792446a4a05c117665873ac7/part-00000-of-00001')
Hi, thank you so much for sharing your work!
I have been trying to reproduce the results on cifar10 on colab using the following command from readme -
!python run.py --train_mode=pretrain --train_batch_size=256 --train_epochs=400 \
--learning_rate=0.2 --learning_rate_scaling=sqrt --proj_out_dim=64 --num_proj_layers=2 \
--weight_decay=1e-4 --temperature=0.2 \
--dataset=cifar10 --data_dir=cifar10/ \
--image_size=32 --eval_split=test --resnet_depth=18 --use_blur=False --color_jitter_strength=0.5 \
--model_dir=pretrain_model \
--use_tpu=True --tpu_name=grpc://10.97.12.250:8470 --cache_dataset=True
but it gives me the following error:
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: File system scheme '[local]' not implemented (file: 'pretrain_model/model.ckpt-0_temp_8abb8541792446a4a05c117665873ac7/part-00000-of-00001') [[node save/SaveV2 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Full logs are available here: https://pastebin.pl/view/26301940
Any help will be highly appreciated. Thanks!
@chentingpc
Not entirely sure, but it seems model_dir
is not compatible with your file system. Maybe try running locally (e.g. unset tpu_name
) and see if there's still issue with the file system?
Hi @chentingpc thanks for your reply!
By running locally, do you mean running on my pc? If so, unfortunately, I can't do that because of system constraints. However, as per your suggestion, I tried unsetting tpu_name in colab itself. It gave me this error:
Traceback (most recent call last): File "run.py", line 449, in
app.run(main) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "run.py", line 396, in main cluster = tf.distribute.cluster_resolver.TPUClusterResolver() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 262, in init raise ValueError('Please provide a TPU Name to connect to.') ValueError: Please provide a TPU Name to connect to.
Did you set use_tpu=False
when you disable tpu_name
?
Hi, after removing use_tpu
and tpu_name
args from the command, the above error got resolved. However, this is the new error -
ValueError:
mesh_shape
must be a vector of size 3 with positive entries; got [2 2 1 2]
My dependencies versions are as follows (from requirements.txt) -
tensorflow==1.15.4, tensorflow-datasets==3.1.0, tensorflow-hub==0.8.0
Do you have any idea how this can be resolved?
Thanks again for your help :)