gcsfs Resolved to no files error. using dd.read

Hi!

dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-559125deb7ec> in <module>
      5 ) and clusterid !=32003
      6 ''',
----> 7 compute=True)

<ipython-input-10-3ef47a2c7f4f> in df_from_bq(query, table, compute, output, dtype)
     60         pass
     61 
---> 62     df = dd.read_csv(gs_path+'{0}'.format(destination_file),  storage_options={'token': key}, dtype=dtype, low_memory=False)
     63 
     64     if compute==True and output==True:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    403         compression=compression,
    404         include_path=include_path_column,
--> 405         **(storage_options or {})
    406     )
    407 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     94 
     95     if len(paths) == 0:
---> 96         raise IOError("%s resolved to no files" % urlpath)
     97 
     98     if blocksize is not None:

OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files

Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function

The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0

UPD: Just a run the same query a few more time with gcsfs 0.3.0 and received the 'resolved to no files' error. Probably the newer version of dask or fsspec 0.6.0 cause the issues

Nov 18 '19 11:11 PoradaKev

Can you reproduce without dask?

Can you bisect your the commit that broke it?

On Nov 18, 2019, at 05:16, PoradaKev [email protected] wrote:

Hi!

dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:

OSError Traceback (most recent call last) in 5 ) and clusterid !=32003 6 ''', ----> 7 compute=True)

in df_from_bq(query, table, compute, output, dtype) 60 pass 61 ---> 62 df = dd.read_csv(gs_path+'{0}'.format(destination_file), storage_options={'token': key}, dtype=dtype, low_memory=False) 63 64 if compute==True and output==True:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 576 storage_options=storage_options, 577 include_path_column=include_path_column, --> 578 **kwargs 579 ) 580

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 403 compression=compression, 404 include_path=include_path_column, --> 405 **(storage_options or {}) 406 ) 407

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs) 94 95 if len(paths) == 0: ---> 96 raise IOError("%s resolved to no files" % urlpath) 97 98 if blocksize is not None:

OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files

Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function

The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Nov 18 '19 12:11 TomAugspurger

Didn't understand your question regarding commit.

I didn't try to reproduct it without dask, but I've just figured out that ffspec 0.6.0 and gcsfs 0.4.0 cause the issue.

Everything works fine with dask 2.8.0, gcsfs 0.3.0 and ffspec 0.5.2

Nov 18 '19 13:11 PoradaKev

CSV resolving seems to be working on latest master of gcsfs, fsspec:

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv")

In [3]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv", storage_options={'token': 'anon'})

In [4]: df.npartitions
Out[4]: 196

Nov 18 '19 14:11 martindurant

This error happened when I create the gcsfs file system first, and modify/reload/.. files.

After the file was modified, when we use the same gcsfs file system, it will raise the FileNotFoundError.

This error existed in both of Pandas and Dask,

One solution I found out is to set "cache_timeout=0" when creating the gcsfs file system.

Nov 19 '19 16:11 baitsnyc

gcsfs
gcsfs copied to clipboard

Resolved to no files error. using dd.read_csv and a globstring

gcsfs gcsfs copied to clipboard

Resolved to no files error. using dd.read_csv and a globstring

gcsfs
gcsfs copied to clipboard