kerchunk
kerchunk copied to clipboard
Best practice for using `rename_target_files`
We've created some references for NetCDF3-64bit-offset files, which we need to do locally (since we can't access them from object storage).
So to convert the local combined64.json
to point to the files on object storage, we did:
from kerchunk.utils import rename_target_files
rename_target_files('combined64.json',
{'/shared/users/rsignell/data/jzambon/nc64/his_20231027.nc':'s3://rsignellbucket1/jzambon/his_20231027.nc',
'/shared/users/rsignell/data/jzambon/nc64/his_20231029.nc':'s3://rsignellbucket1/jzambon/his_20231029.nc',
'/shared/users/rsignell/data/jzambon/nc64/his_20231030.nc':'s3://rsignellbucket1/jzambon/his_20231030.nc',
'/shared/users/rsignell/data/jzambon/nc64/his_20231031.nc':'s3://rsignellbucket1/jzambon/his_20231031.nc'},
'combined64_s3.json')
which works fine for our test case (4 files), but we are guessing there is a smarter way for lots of URLs, right?
You can phrase the dict as a comprehension
{k: k.replace('/shared/users/rsignell/data/jzambon/nc64', 's3://rsignellbucket1/jzambon/') for k in fs.glob("/shared/users/rsignell/data/jzambon/nc64/*.nc"}
where fs is a localFS.
That's all I can immediately think of.
Have you tried rename_target_files
with parquet? I don't think that's come up yet.
- I like the dict comprehension!
- I have not tried
rename_target_files
with parquet. And I guess we would need that if we were working with NetCDF3 or NetCDF3-64-bit-offset files where the references got too big and we want to access them from object storage!
kerchunk.netCDF3 does support scanning directly from remote.
Following https://github.com/fsspec/kerchunk/pull/391 (I think), the version=
should be inferred rather than any need to pass it, and it enables writing references directly to parquet during the initial file scan.
https://github.com/fsspec/kerchunk/pull/391/files#diff-5fc74e71e7b4cdb2921590ed60a21bae7a9fe30c8ffeb62a3fb13066ebb01bbdR73 (and actually, it won't allow version= to override the value here, which maybe I should fix)
OK, should now work whether you pass version= or not.