xarray icon indicating copy to clipboard operation
xarray copied to clipboard

`"source"` encoding for datasets opened from `fsspec` objects

Open keewis opened this issue 10 months ago • 5 comments

When opening files from path-like objects (str, pathlib.Path), the backend machinery (_dataset_from_backend_dataset) sets the "source" encoding. This is useful if we need the original path for additional processing, like writing to a similarly named file, or to extract additional metadata. This would be useful as well when using fsspec to open remote files.

In this PR, I'm extracting the path attribute that most fsspec objects have to set that value. I've considered using isinstance checks instead of the getattr-with-default, but the list of potential classes is too big to be practical (at least 4 classes just within fsspec itself).

If this sounds like a good idea, I'll update the documentation of the "source" encoding to mention this feature.

  • [x] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

keewis avatar Apr 09 '24 19:04 keewis

Without knowing much (I generally ds.reset_encoding()) it does sound like a good idea!

max-sixty avatar Apr 09 '24 20:04 max-sixty

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

Illviljan avatar Apr 09 '24 21:04 Illviljan

the main use case is indeed to extract additional data, which you'd do immediately after open_dataset (after which you could drop the encoding).

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

As far as I can tell, they only convert path-likes to string (which these objects are not, they are file-like, not path-like). Are you suggesting we should change that?

keewis avatar Apr 09 '24 21:04 keewis

I think this is fine, but our long-term goal is to delete encoding so you might consider a different solution to your problem.

dcherian avatar Apr 23 '24 16:04 dcherian

my impression of that discussion was that we wanted to either return the encoding in a separate object, or somehow remove the encoding after the first operation (i.e. not carry it around). Either way would be fine with me, since I would still have access to it immediately after opening.

keewis avatar Apr 23 '24 16:04 keewis

Would a dataset with this in encoding be round tripped without error? Would be good to test that

dcherian avatar Jun 24 '24 12:06 dcherian

Would a dataset with this in encoding be round tripped without error? Would be good to test that

I'm not opposed to adding an explicit test (since I can't find any existing one right now), but if it would cause problems we'd also have those with string paths / urls – and those have been working just fine since long ago.

As far as I can tell, "source", as well as "original_shape", are dropped from the encoding before doing anything else (search for safe_to_drop for where that happens).

keewis avatar Jun 24 '24 12:06 keewis

Ah thanks. My mistake m I thought we were sticking in the fsspec object not just the path

dcherian avatar Jun 24 '24 12:06 dcherian

as far as I can tell, we could write anything in that encoding (fsspec objects, strings, or other things), and it would simply be ignored / dropped before writing.

keewis avatar Jun 24 '24 13:06 keewis