gcsfs
gcsfs copied to clipboard
FileNotFoundError on file.read
I have uploaded a file to GCS and was able to read it without problem. After that I deleted the file and uploaded it again with a same name. When I called file.read function again, I receive FileNotFoundError coming from fetch range though the file exists. Full traceback:
byte_str = csv_file.read(4096)
File "/opt/conda/default/lib/python3.6/site-packages/fsspec/spec.py", line 1040, in read
out = self.cache._fetch(self.loc, self.loc + length)
File "/opt/conda/default/lib/python3.6/site-packages/fsspec/core.py", line 464, in _fetch
self.cache = self.fetcher(start, end + self.blocksize)
File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-22>", line 2, in _fetch_range
File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
return f(self, *args, **kwargs)
File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 1067, in _fetch_range
headers=head)
File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-2>", line 2, in _call
File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
return f(self, *args, **kwargs)
File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 462, in _call
validate_response(r, path)
File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 157, in validate_response
raise FileNotFoundError
FileNotFoundError
I have invalided the cache and cleared instance cache before reading the file but they didn't help.
Versions: gcsfs==0.3.0 dask==2.1.0
Can you please give the sequence of calls you ran so I can understand better?
@zafercavdar if you invalidate the cache, does the problem still occur?
fs.invalidate_cache()
before reading?
Are you uploading the file with gcsfs, or another system?
I'm getting the exact same behavior in a cluster of worker processes and I'm having a hard time reproducing. It looks something like this:
- Master machine creates file at path X
- Master enqueues a work item to process path X
- Worker picks up task about two minutes later, raises FileNotFound when reading X (with the same backtrace as above)
Caching was my first suspicion, but this looks like a pretty weird place to be failing if caching is the culprit. While I can't reproduce by running a python shell and calling .open()
and friends, I have a bunch of workers that seem to be in a state that trips over this pretty frequently -- happy to probe more if you can suggest a direction to investigate?
(gcsfs==0.3.0, fsspec==0.6.0)
@JohnEmhoff can you try with gcsfs 0.4.0 first?
Sure, I'll give 0.4.0 a try. On that note, would you consider pinning dependencies in releases, especially fsspec? I could be wrong but this feels like a regression -- we're pinned to 0.3.0 and occasionally see things like this come up.
Typically applications pin dependencies, not libraries. Why do you think this is a version-dependency issue between gcsfs and fsspec?
I assume it's an issue between gcsfs and fsspec because this started happening out of the blue even though we were still using the same gcsfs version. We've been bitten by updates to fsspec in the past so I just thought the chances were good it's happening again.
Yes, we're pinning to 0.3.0 but I guess we could pin fsspec as well. It's a bit muddy because we're also building a library. All else being equal, it would just be nice to have a version we can depend on being stable over time.
So how about we make the dircache attribute of FSs (those that use it at all) also be pluggable, so that users can choose a dummy one that doesn't store anything or, say, one that expires entries after a given time or in a LRU fashion?
I'm of the opinion that caching should be done at the application level rather in the library, although I know there are a lot of different use cases out there.
Having this same issue with pandas==1.0.0, gcsfs==0.6.0, and fsspec==0.6.2 when loading via pd.read_csv
@ZaxR can you post a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
My recollection was that this issue was resolved.
I can confirm that it's a cache issue like you suggested above:
import gcsfs
import pandas as pd
from google.cloud import storage
bucket_name = "testing"
blob_name = "test.csv"
client = storage.Client()
bucket = client.get_bucket(bucket_name)
data = "some data"
bucket.blob(blob_name).upload_from_string(data)
# Test #1 Succeeds
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")
client.get_bucket(bucket_name).delete_blob(blob_name)
bucket.blob(blob_name).upload_from_string(data)
# Test #2 Fails
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")
# Test #3 Fails
fs = gcsfs.GCSFileSystem()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
print(f.read())
# Test #4 Succeeds
fs.invalidate_cache()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
print(f.read())
fsspec caches file-system instances, so that when you do df = pd.read_csv()
the first time, you create the instance, and the later calls get this same instance, with it's cached file listing.
I see. I guess I'd agree with @JohnEmhoff re: caching living at the application level, though not sure of all the implications of that. In my case, I just want to be able to always get a file if it currently actually exists on gcs. I guess alternatively this could just be considered an issue I'm having with pandas' read_csv, which to my knowledge doesn't have a flag to let me clear the cache or alter a fs object. In the meantime, any suggestions to get the desired effect?
Having the same problem when spawning multiple workers to read from the same file via read()
. Some of them randomly fails, even though I start the FS instance with gcsfs.GCSFileSystem(skip_instance_cache=True)
Using gcsfs version 2021.05.0
EDIT:
Using the proposed workaround @ZaxR I was able to implement a single one-time retry (try/except FileNotFoundError) where I called fs.invalidate_cache()
. However, this bug is rather annoying as the file is can actually be missing in our system.
I wonder if some of this is simple "eventual consistency" by google? When a file is created/written, it is not necessarily visible to all clients immediately.
To my knowledge, GCS provides strong consistency for read-after-write
https://cloud.google.com/storage/docs/consistency#strongly_consistent_operations
I'm running into read-after-write issues implementing GCS in https://github.com/rstudio/pins-python, that I don't see with the S3 filesystem, so can dig into this a bit!