simclr icon indicating copy to clipboard operation
simclr copied to clipboard

ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: File system scheme '[local]' not implemented (file: 'pretrain_model/model.ckpt-0_temp_8abb8541792446a4a05c117665873ac7/part-00000-of-00001')

Open Riya-11 opened this issue 3 years ago • 4 comments

Hi, thank you so much for sharing your work!

I have been trying to reproduce the results on cifar10 on colab using the following command from readme -

!python run.py --train_mode=pretrain  --train_batch_size=256 --train_epochs=400  \
--learning_rate=0.2 --learning_rate_scaling=sqrt --proj_out_dim=64 --num_proj_layers=2 \
--weight_decay=1e-4 --temperature=0.2 \
--dataset=cifar10 --data_dir=cifar10/ \
--image_size=32 --eval_split=test --resnet_depth=18   --use_blur=False --color_jitter_strength=0.5  \
--model_dir=pretrain_model \
--use_tpu=True --tpu_name=grpc://10.97.12.250:8470 --cache_dataset=True

but it gives me the following error:

ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: File system scheme '[local]' not implemented (file: 'pretrain_model/model.ckpt-0_temp_8abb8541792446a4a05c117665873ac7/part-00000-of-00001') [[node save/SaveV2 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Full logs are available here: https://pastebin.pl/view/26301940

Any help will be highly appreciated. Thanks!

@chentingpc

Riya-11 avatar May 02 '21 21:05 Riya-11

Not entirely sure, but it seems model_dir is not compatible with your file system. Maybe try running locally (e.g. unset tpu_name) and see if there's still issue with the file system?

chentingpc avatar May 03 '21 14:05 chentingpc

Hi @chentingpc thanks for your reply!

By running locally, do you mean running on my pc? If so, unfortunately, I can't do that because of system constraints. However, as per your suggestion, I tried unsetting tpu_name in colab itself. It gave me this error:

Traceback (most recent call last): File "run.py", line 449, in app.run(main) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "run.py", line 396, in main cluster = tf.distribute.cluster_resolver.TPUClusterResolver() File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 262, in init raise ValueError('Please provide a TPU Name to connect to.') ValueError: Please provide a TPU Name to connect to.

Riya-11 avatar May 03 '21 18:05 Riya-11

Did you set use_tpu=False when you disable tpu_name?

chentingpc avatar May 04 '21 03:05 chentingpc

Hi, after removing use_tpu and tpu_name args from the command, the above error got resolved. However, this is the new error -

ValueError: mesh_shape must be a vector of size 3 with positive entries; got [2 2 1 2]

My dependencies versions are as follows (from requirements.txt) -

tensorflow==1.15.4, tensorflow-datasets==3.1.0, tensorflow-hub==0.8.0

Do you have any idea how this can be resolved?

Thanks again for your help :)

Riya-11 avatar May 04 '21 16:05 Riya-11