pystac icon indicating copy to clipboard operation
pystac copied to clipboard

Problem with relative, unix-like hrefs on Windows

Open maawoo opened this issue 3 years ago • 3 comments

Hi everyone! I'm not sure what exactly I'm doing wrong or if this is a possible bug. So any help or advice is appreciated! I have a bunch of STAC items with relative, unix-like asset hrefs. From these items I simply want to create a self-contained
catalog. That's about it. I've reduced my code to a simple example using only one collection (of an overall larger, tile-based catalog structure) and one item:

import pystac
import os

data_dir = 'C:\\my\\data\\dir'
tile = 'tile_id'
scene = 'scene_name'

sp_extent = pystac.SpatialExtent([None, None, None, None])
tmp_extent = pystac.TemporalExtent([None, None])

test_collection = pystac.Collection(id=tile, description='test collection', 
                                    extent=pystac.Extent(sp_extent, tmp_extent))

item_path = os.path.join(data_dir, tile, scene, scene + '.json')
item = pystac.read_file(href=item_path)

test_collection.add_item(item=item)
extent = test_collection.extent.from_items(items=[item])
test_collection.extent = extent

Now, when I normalize all hrefs and save my collection, as shown here in the docs, I kind of get what I want. All hrefs, including those of item assets, are relative 👍🏻 but they use windows-style double backslashes 👎🏻. This causes problems with STAC Browser for example.

test_collection.normalize_hrefs(root_href=os.path.join(data_dir, tile)
test_collection.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)

case_A

If I replace the backward slashes from the root_href to normalize against, it results in a collection with relative, unix-style hrefs 👍🏻 but, for whatever reason, also in absolute hrefs for all item assets 👎🏻

test_collection.normalize_hrefs(root_href=os.path.join(data_dir, tile).replace('\\', '/'))
test_collection.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)

image10

I've also played around with the make_all_asset_hrefs_relative method, but it didn't help unfortunately.

maawoo avatar Sep 02 '21 20:09 maawoo

@maawoo Thanks for raising this and apologies for the long delay in responding.

The first attempt that you describe (ending up with relative HREFs using Windows-style double-backslashes) seems to me to be the expected behavior for PySTAC given that you are running on Windows. However, it brings up a question for me about what a "self-contained" catalog should look like and what the STAC tool ecosystem needs to support for self-contained catalogs.

The STAC Best Practices docs on Self-contained Catalogs does not say anything about the style of the path separators, but in order for a STAC to be truly portable across systems it seems like we should have some standards around that.

It seems like we could take one of two approaches here:

  1. Update the Best Practices to prescribe a particular style of path separators for relative links in self-contained catalogs (probably Unix-style) and update tooling like PySTAC to be able to navigate these relative links regardless of the OS.
  2. Allow either style of relative path and update STAC tooling like STAC Browser to be able to navigate either kind of link (possible based on the underlying OS).

@lossyrob @m-mohr @gadomski I'm curious to get your thoughts on how to handle this and whether there have already been discussions about this during the development of the spec.

duckontheweb avatar Jan 20 '22 19:01 duckontheweb

Since the JSON schema specifies the href type as an iri-reference (e.g. https://github.com/radiantearth/stac-spec/blob/789e90fba4c7119bc925d26f75f47bacf44c91e6/item-spec/json-schema/item.json#L239-L244), it seems like option 1 would align best. If we decided on that path, I agree that some Best Practices would be in order to clarify that Windows-style path separators are not supported -- there's not really much in there about path separators right now as far as I can tell.

gadomski avatar Jan 20 '22 20:01 gadomski

As STAC is for HTTP (see https://github.com/radiantearth/stac-spec/blob/master/principles.md, second bullet point) and the path separator for HTTP is always / (see https://www.rfc-editor.org/rfc/rfc3986#section-3.3), I think PySTAC should always generate / for HTTP usage regardless of the underlying OS that is used for creating the catalog.

Edit: RFC 3986 (URIs) is not mentioned directly in STAC spec, but RFC 8288 (Web Linking) is mentioned in https://github.com/radiantearth/stac-spec/blob/master/overview.md#foundations as a foundation of STAC. RFC 8288 is requiring URIs to conform to RFC 3986. So it's documented, but not very explicit. So I'd actually consider that as being a bug in PySTAC.

test_collection.normalize_hrefs(root_href=os.path.join(data_dir, tile).replace('\\', '/'))

Using OS libraries for URIs is a common pitfall, I recently had the same issue in Node. They are not meant to be used for this. So for example instead of os.path.join, I think urllib.parse.urljoin should be used (disclaimer: my Python knowledge is still limited, so there might be better alternatives ;-) ).

m-mohr avatar Jan 20 '22 21:01 m-mohr

I can take this one.

jsignell avatar Feb 20 '23 14:02 jsignell

Just to check my understanding: the idea is to not allow windows-style paths as hrefs on objects? This feels like the type of thing that should start raising a deprecation warning and then the real change should wait for a major release.

jsignell avatar Feb 20 '23 15:02 jsignell

the idea is to not allow windows-style paths as hrefs on objects?

I think we want to always write /-delimited paths, e.g. when serializing to a dictionary. It'd be nice if we could still accept \-delimited paths on input, e.g. for StacObject.from_file -- otherwise, the user experience of having to convert to / before reading is poor.

Is that enough guidance, or does that leave open questions?

gadomski avatar Feb 20 '23 15:02 gadomski

I think we want to always write /-delimited paths, e.g. when serializing to a dictionary.

This part is clear to me :+1:

It'd be nice if we could still accept \-delimited paths on input, e.g. for StacObject.from_file -- otherwise, the user experience of having to convert to / before reading is poor.

This is the part I was unclear about. So we should treat C:\\data\\diras equivalent to c://data/dir internally?

jsignell avatar Feb 20 '23 15:02 jsignell

So we should treat C:\data\diras equivalent to c://data/dir internally?

I think, if possible, we should use/mimic pathlib (e.g. https://docs.python.org/3/library/pathlib.html#pathlib.WindowsPath), so the preferred behavior would be (on a windoze system) (I think):

item = Item.from_file("C:\\data\item.json")
self_href = item.get_self_href()
assert self_href
assert self_href == "c:/data/item.json"

gadomski avatar Feb 20 '23 15:02 gadomski

ok yeah I got confused about what best practices was referring to. I was missing that in the case of pystac it's the generated output that needs to conform to best practices, not the user input.

jsignell avatar Feb 20 '23 16:02 jsignell