gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

Resolved to no files error. using dd.read_csv and a globstring

Open PoradaKev opened this issue 5 years ago • 4 comments

Hi!

dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-559125deb7ec> in <module>
      5 ) and clusterid !=32003
      6 ''',
----> 7 compute=True)

<ipython-input-10-3ef47a2c7f4f> in df_from_bq(query, table, compute, output, dtype)
     60         pass
     61 
---> 62     df = dd.read_csv(gs_path+'{0}'.format(destination_file),  storage_options={'token': key}, dtype=dtype, low_memory=False)
     63 
     64     if compute==True and output==True:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    403         compression=compression,
    404         include_path=include_path_column,
--> 405         **(storage_options or {})
    406     )
    407 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     94 
     95     if len(paths) == 0:
---> 96         raise IOError("%s resolved to no files" % urlpath)
     97 
     98     if blocksize is not None:

OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files


Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function

The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0

UPD: Just a run the same query a few more time with gcsfs 0.3.0 and received the 'resolved to no files' error. Probably the newer version of dask or fsspec 0.6.0 cause the issues

PoradaKev avatar Nov 18 '19 11:11 PoradaKev

Can you reproduce without dask?

Can you bisect your the commit that broke it?

On Nov 18, 2019, at 05:16, PoradaKev [email protected] wrote:

 Hi!

dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:


OSError Traceback (most recent call last) in 5 ) and clusterid !=32003 6 ''', ----> 7 compute=True)

in df_from_bq(query, table, compute, output, dtype) 60 pass 61 ---> 62 df = dd.read_csv(gs_path+'{0}'.format(destination_file), storage_options={'token': key}, dtype=dtype, low_memory=False) 63 64 if compute==True and output==True:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 576 storage_options=storage_options, 577 include_path_column=include_path_column, --> 578 **kwargs 579 ) 580

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 403 compression=compression, 404 include_path=include_path_column, --> 405 **(storage_options or {}) 406 ) 407

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs) 94 95 if len(paths) == 0: ---> 96 raise IOError("%s resolved to no files" % urlpath) 97 98 if blocksize is not None:

OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files

Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function

The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

TomAugspurger avatar Nov 18 '19 12:11 TomAugspurger

Didn't understand your question regarding commit.

I didn't try to reproduct it without dask, but I've just figured out that ffspec 0.6.0 and gcsfs 0.4.0 cause the issue.

Everything works fine with dask 2.8.0, gcsfs 0.3.0 and ffspec 0.5.2

PoradaKev avatar Nov 18 '19 13:11 PoradaKev

CSV resolving seems to be working on latest master of gcsfs, fsspec:

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv")

In [3]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv", storage_options={'token': 'anon'})

In [4]: df.npartitions
Out[4]: 196

martindurant avatar Nov 18 '19 14:11 martindurant

This error happened when I create the gcsfs file system first, and modify/reload/.. files.

After the file was modified, when we use the same gcsfs file system, it will raise the FileNotFoundError.

This error existed in both of Pandas and Dask,

One solution I found out is to set "cache_timeout=0" when creating the gcsfs file system.

baitsnyc avatar Nov 19 '19 16:11 baitsnyc