Consider storing `data_uri` in component parts
In the Catalog assets table, we store data_uri as one string:
sqlite> select * from assets;
id data_uri is_directory hash_type hash_content size time_created time_updated
-- --------------------------------------- ------------ --------- ------------ ---- ------------------- -------------------
1 file://localhost/tmp/tmp_qqdv5m4/data/y 1 2024-02-20 12:52:14 2024-02-20 12:52:14
Prompted by a comment from @callumforrester, I'm wondering we should be storing this in component parts e.g. scheme (file), netloc (localhost), and path (/tmp/tmp_qqdv5m4/data/y).
Why? This would enables us to index them separately and easily:
- Filter by scheme ("Show file-based URIs only...")
- Extract (and possibly rewrite) the paths
- In the future, extract (and possibly rewrite) the domain (~
netloc)
How? The complete specification is large. The httpx.URL docstring has useful ASCII art:
The components of a URL are broken down like this:
https://jo%40email.com:a%20secret@müller.de:1234/pa%20th?search=ab#anchorlink
[scheme] [ username ] [password] [ host ][port][ path ] [ query ] [fragment]
[ userinfo ] [ netloc ][ raw_path ]
We should give some thought to how finely to carve it, and whether or not we want to include everything at the start. (I'm not sure about fragment and I have concerns about userinfo.)
We could keep everything in one column but make it a JSON[B] column. That would make the components index-able. However I think that because the structure is well-defined and each entry is homogeneous, making a column for each component is the way to go.
I think this would be my first pass:
scheme VARCHAR
netloc VARCHAR
path VARCHAR
query JSON[B] # array of [key, value] arrays
- Omit
fragmentbecause I've never seen a data_uri that uses it, and we can easily add it if we find one. - Omit
userinfobecause we should not be handling plaintext credentials in our database, full stop. We can revisit how to handle this kind of situation when we add support for data drawn from non-file-based sources.