gcsfs FileNotFoundError on file.read

I have uploaded a file to GCS and was able to read it without problem. After that I deleted the file and uploaded it again with a same name. When I called file.read function again, I receive FileNotFoundError coming from fetch range though the file exists. Full traceback:

    byte_str = csv_file.read(4096)
  File "/opt/conda/default/lib/python3.6/site-packages/fsspec/spec.py", line 1040, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/opt/conda/default/lib/python3.6/site-packages/fsspec/core.py", line 464, in _fetch
    self.cache = self.fetcher(start, end + self.blocksize)
  File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-22>", line 2, in _fetch_range
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
    return f(self, *args, **kwargs)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 1067, in _fetch_range
    headers=head)
  File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-2>", line 2, in _call
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
    return f(self, *args, **kwargs)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 462, in _call
    validate_response(r, path)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 157, in validate_response
    raise FileNotFoundError
FileNotFoundError

I have invalided the cache and cleared instance cache before reading the file but they didn't help.

Versions: gcsfs==0.3.0 dask==2.1.0

Oct 11 '19 22:10 zafercavdar

Can you please give the sequence of calls you ran so I can understand better?

Oct 11 '19 23:10 martindurant

@zafercavdar if you invalidate the cache, does the problem still occur?

fs.invalidate_cache()

before reading?

Are you uploading the file with gcsfs, or another system?

Nov 08 '19 20:11 TomAugspurger

I'm getting the exact same behavior in a cluster of worker processes and I'm having a hard time reproducing. It looks something like this:

Master machine creates file at path X
Master enqueues a work item to process path X
Worker picks up task about two minutes later, raises FileNotFound when reading X (with the same backtrace as above)

Caching was my first suspicion, but this looks like a pretty weird place to be failing if caching is the culprit. While I can't reproduce by running a python shell and calling .open() and friends, I have a bunch of workers that seem to be in a state that trips over this pretty frequently -- happy to probe more if you can suggest a direction to investigate?

(gcsfs==0.3.0, fsspec==0.6.0)

Nov 24 '19 12:11 JohnEmhoff

@JohnEmhoff can you try with gcsfs 0.4.0 first?

Nov 25 '19 12:11 TomAugspurger

Sure, I'll give 0.4.0 a try. On that note, would you consider pinning dependencies in releases, especially fsspec? I could be wrong but this feels like a regression -- we're pinned to 0.3.0 and occasionally see things like this come up.

Nov 25 '19 15:11 JohnEmhoff

Typically applications pin dependencies, not libraries. Why do you think this is a version-dependency issue between gcsfs and fsspec?

Nov 25 '19 15:11 TomAugspurger

I assume it's an issue between gcsfs and fsspec because this started happening out of the blue even though we were still using the same gcsfs version. We've been bitten by updates to fsspec in the past so I just thought the chances were good it's happening again.

Yes, we're pinning to 0.3.0 but I guess we could pin fsspec as well. It's a bit muddy because we're also building a library. All else being equal, it would just be nice to have a version we can depend on being stable over time.

Nov 25 '19 19:11 JohnEmhoff

So how about we make the dircache attribute of FSs (those that use it at all) also be pluggable, so that users can choose a dummy one that doesn't store anything or, say, one that expires entries after a given time or in a LRU fashion?

Nov 26 '19 22:11 martindurant

I'm of the opinion that caching should be done at the application level rather in the library, although I know there are a lot of different use cases out there.

Nov 27 '19 18:11 JohnEmhoff

Having this same issue with pandas==1.0.0, gcsfs==0.6.0, and fsspec==0.6.2 when loading via pd.read_csv

Feb 04 '20 19:02 ZaxR

@ZaxR can you post a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

My recollection was that this issue was resolved.

Feb 04 '20 19:02 TomAugspurger

I can confirm that it's a cache issue like you suggested above:

import gcsfs
import pandas as pd
from google.cloud import storage

bucket_name = "testing"
blob_name = "test.csv"
client = storage.Client()
bucket = client.get_bucket(bucket_name)

data = "some data"
bucket.blob(blob_name).upload_from_string(data)

# Test #1 Succeeds
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")

client.get_bucket(bucket_name).delete_blob(blob_name)
bucket.blob(blob_name).upload_from_string(data)

# Test #2 Fails
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")

# Test #3 Fails
fs = gcsfs.GCSFileSystem()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
    print(f.read())

# Test #4 Succeeds
fs.invalidate_cache()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
    print(f.read())

Feb 04 '20 19:02 ZaxR

fsspec caches file-system instances, so that when you do df = pd.read_csv() the first time, you create the instance, and the later calls get this same instance, with it's cached file listing.

Feb 04 '20 19:02 martindurant

I see. I guess I'd agree with @JohnEmhoff re: caching living at the application level, though not sure of all the implications of that. In my case, I just want to be able to always get a file if it currently actually exists on gcs. I guess alternatively this could just be considered an issue I'm having with pandas' read_csv, which to my knowledge doesn't have a flag to let me clear the cache or alter a fs object. In the meantime, any suggestions to get the desired effect?

Feb 04 '20 19:02 ZaxR

Having the same problem when spawning multiple workers to read from the same file via read(). Some of them randomly fails, even though I start the FS instance with gcsfs.GCSFileSystem(skip_instance_cache=True)

Using gcsfs version 2021.05.0

EDIT: Using the proposed workaround @ZaxR I was able to implement a single one-time retry (try/except FileNotFoundError) where I called fs.invalidate_cache(). However, this bug is rather annoying as the file is can actually be missing in our system.

May 25 '21 14:05 Trollgeir

I wonder if some of this is simple "eventual consistency" by google? When a file is created/written, it is not necessarily visible to all clients immediately.

Jun 14 '21 13:06 martindurant

To my knowledge, GCS provides strong consistency for read-after-write

https://cloud.google.com/storage/docs/consistency#strongly_consistent_operations

I'm running into read-after-write issues implementing GCS in https://github.com/rstudio/pins-python, that I don't see with the S3 filesystem, so can dig into this a bit!

May 16 '22 18:05 machow

gcsfs gcsfs copied to clipboard

FileNotFoundError on file.read

gcsfs
gcsfs copied to clipboard