kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Best practice for using `rename_target_files`

Open rsignell opened this issue 1 year ago • 5 comments

We've created some references for NetCDF3-64bit-offset files, which we need to do locally (since we can't access them from object storage).

So to convert the local combined64.json to point to the files on object storage, we did:

from kerchunk.utils import rename_target_files

rename_target_files('combined64.json',
                   {'/shared/users/rsignell/data/jzambon/nc64/his_20231027.nc':'s3://rsignellbucket1/jzambon/his_20231027.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231029.nc':'s3://rsignellbucket1/jzambon/his_20231029.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231030.nc':'s3://rsignellbucket1/jzambon/his_20231030.nc',
                    '/shared/users/rsignell/data/jzambon/nc64/his_20231031.nc':'s3://rsignellbucket1/jzambon/his_20231031.nc'},
                    'combined64_s3.json')

which works fine for our test case (4 files), but we are guessing there is a smarter way for lots of URLs, right?

rsignell avatar Nov 09 '23 14:11 rsignell

You can phrase the dict as a comprehension

{k: k.replace('/shared/users/rsignell/data/jzambon/nc64', 's3://rsignellbucket1/jzambon/') for k in fs.glob("/shared/users/rsignell/data/jzambon/nc64/*.nc"}

where fs is a localFS.

That's all I can immediately think of.

Have you tried rename_target_files with parquet? I don't think that's come up yet.

martindurant avatar Nov 09 '23 14:11 martindurant

  1. I like the dict comprehension!
  2. I have not tried rename_target_files with parquet. And I guess we would need that if we were working with NetCDF3 or NetCDF3-64-bit-offset files where the references got too big and we want to access them from object storage!

rsignell avatar Nov 09 '23 15:11 rsignell

kerchunk.netCDF3 does support scanning directly from remote.

Following https://github.com/fsspec/kerchunk/pull/391 (I think), the version= should be inferred rather than any need to pass it, and it enables writing references directly to parquet during the initial file scan.

martindurant avatar Nov 09 '23 15:11 martindurant

https://github.com/fsspec/kerchunk/pull/391/files#diff-5fc74e71e7b4cdb2921590ed60a21bae7a9fe30c8ffeb62a3fb13066ebb01bbdR73 (and actually, it won't allow version= to override the value here, which maybe I should fix)

martindurant avatar Nov 09 '23 15:11 martindurant

OK, should now work whether you pass version= or not.

martindurant avatar Nov 09 '23 15:11 martindurant