gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

FileNotFoundError on file.read

Open zafercavdar opened this issue 5 years ago • 17 comments

I have uploaded a file to GCS and was able to read it without problem. After that I deleted the file and uploaded it again with a same name. When I called file.read function again, I receive FileNotFoundError coming from fetch range though the file exists. Full traceback:

    byte_str = csv_file.read(4096)
  File "/opt/conda/default/lib/python3.6/site-packages/fsspec/spec.py", line 1040, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/opt/conda/default/lib/python3.6/site-packages/fsspec/core.py", line 464, in _fetch
    self.cache = self.fetcher(start, end + self.blocksize)
  File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-22>", line 2, in _fetch_range
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
    return f(self, *args, **kwargs)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 1067, in _fetch_range
    headers=head)
  File "</opt/conda/default/lib/python3.6/site-packages/decorator.py:decorator-gen-2>", line 2, in _call
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 54, in _tracemethod
    return f(self, *args, **kwargs)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 462, in _call
    validate_response(r, path)
  File "/opt/conda/default/lib/python3.6/site-packages/gcsfs/core.py", line 157, in validate_response
    raise FileNotFoundError
FileNotFoundError

I have invalided the cache and cleared instance cache before reading the file but they didn't help.

Versions: gcsfs==0.3.0 dask==2.1.0

zafercavdar avatar Oct 11 '19 22:10 zafercavdar

Can you please give the sequence of calls you ran so I can understand better?

martindurant avatar Oct 11 '19 23:10 martindurant

@zafercavdar if you invalidate the cache, does the problem still occur?

fs.invalidate_cache()

before reading?

Are you uploading the file with gcsfs, or another system?

TomAugspurger avatar Nov 08 '19 20:11 TomAugspurger

I'm getting the exact same behavior in a cluster of worker processes and I'm having a hard time reproducing. It looks something like this:

  1. Master machine creates file at path X
  2. Master enqueues a work item to process path X
  3. Worker picks up task about two minutes later, raises FileNotFound when reading X (with the same backtrace as above)

Caching was my first suspicion, but this looks like a pretty weird place to be failing if caching is the culprit. While I can't reproduce by running a python shell and calling .open() and friends, I have a bunch of workers that seem to be in a state that trips over this pretty frequently -- happy to probe more if you can suggest a direction to investigate?

(gcsfs==0.3.0, fsspec==0.6.0)

JohnEmhoff avatar Nov 24 '19 12:11 JohnEmhoff

@JohnEmhoff can you try with gcsfs 0.4.0 first?

TomAugspurger avatar Nov 25 '19 12:11 TomAugspurger

Sure, I'll give 0.4.0 a try. On that note, would you consider pinning dependencies in releases, especially fsspec? I could be wrong but this feels like a regression -- we're pinned to 0.3.0 and occasionally see things like this come up.

JohnEmhoff avatar Nov 25 '19 15:11 JohnEmhoff

Typically applications pin dependencies, not libraries. Why do you think this is a version-dependency issue between gcsfs and fsspec?

TomAugspurger avatar Nov 25 '19 15:11 TomAugspurger

I assume it's an issue between gcsfs and fsspec because this started happening out of the blue even though we were still using the same gcsfs version. We've been bitten by updates to fsspec in the past so I just thought the chances were good it's happening again.

Yes, we're pinning to 0.3.0 but I guess we could pin fsspec as well. It's a bit muddy because we're also building a library. All else being equal, it would just be nice to have a version we can depend on being stable over time.

JohnEmhoff avatar Nov 25 '19 19:11 JohnEmhoff

So how about we make the dircache attribute of FSs (those that use it at all) also be pluggable, so that users can choose a dummy one that doesn't store anything or, say, one that expires entries after a given time or in a LRU fashion?

martindurant avatar Nov 26 '19 22:11 martindurant

I'm of the opinion that caching should be done at the application level rather in the library, although I know there are a lot of different use cases out there.

JohnEmhoff avatar Nov 27 '19 18:11 JohnEmhoff

Having this same issue with pandas==1.0.0, gcsfs==0.6.0, and fsspec==0.6.2 when loading via pd.read_csv

ZaxR avatar Feb 04 '20 19:02 ZaxR

@ZaxR can you post a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

My recollection was that this issue was resolved.

TomAugspurger avatar Feb 04 '20 19:02 TomAugspurger

I can confirm that it's a cache issue like you suggested above:

import gcsfs
import pandas as pd
from google.cloud import storage

bucket_name = "testing"
blob_name = "test.csv"
client = storage.Client()
bucket = client.get_bucket(bucket_name)

data = "some data"
bucket.blob(blob_name).upload_from_string(data)

# Test #1 Succeeds
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")

client.get_bucket(bucket_name).delete_blob(blob_name)
bucket.blob(blob_name).upload_from_string(data)

# Test #2 Fails
df = pd.read_csv(f"gs://{bucket_name}/{blob_name}")

# Test #3 Fails
fs = gcsfs.GCSFileSystem()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
    print(f.read())

# Test #4 Succeeds
fs.invalidate_cache()
with fs.open(f"{bucket_name}/{blob_name}", 'rb') as f:
    print(f.read())

ZaxR avatar Feb 04 '20 19:02 ZaxR

fsspec caches file-system instances, so that when you do df = pd.read_csv() the first time, you create the instance, and the later calls get this same instance, with it's cached file listing.

martindurant avatar Feb 04 '20 19:02 martindurant

I see. I guess I'd agree with @JohnEmhoff re: caching living at the application level, though not sure of all the implications of that. In my case, I just want to be able to always get a file if it currently actually exists on gcs. I guess alternatively this could just be considered an issue I'm having with pandas' read_csv, which to my knowledge doesn't have a flag to let me clear the cache or alter a fs object. In the meantime, any suggestions to get the desired effect?

ZaxR avatar Feb 04 '20 19:02 ZaxR

Having the same problem when spawning multiple workers to read from the same file via read(). Some of them randomly fails, even though I start the FS instance with gcsfs.GCSFileSystem(skip_instance_cache=True)

Using gcsfs version 2021.05.0

EDIT: Using the proposed workaround @ZaxR I was able to implement a single one-time retry (try/except FileNotFoundError) where I called fs.invalidate_cache(). However, this bug is rather annoying as the file is can actually be missing in our system.

Trollgeir avatar May 25 '21 14:05 Trollgeir

I wonder if some of this is simple "eventual consistency" by google? When a file is created/written, it is not necessarily visible to all clients immediately.

martindurant avatar Jun 14 '21 13:06 martindurant

To my knowledge, GCS provides strong consistency for read-after-write

https://cloud.google.com/storage/docs/consistency#strongly_consistent_operations

I'm running into read-after-write issues implementing GCS in https://github.com/rstudio/pins-python, that I don't see with the S3 filesystem, so can dig into this a bit!

machow avatar May 16 '22 18:05 machow