tiled icon indicating copy to clipboard operation
tiled copied to clipboard

Consider storing `data_uri` in component parts

Open danielballan opened this issue 1 year ago • 1 comments

In the Catalog assets table, we store data_uri as one string:

sqlite> select * from assets;
id  data_uri                                 is_directory  hash_type  hash_content  size  time_created         time_updated       
--  ---------------------------------------  ------------  ---------  ------------  ----  -------------------  -------------------
1   file://localhost/tmp/tmp_qqdv5m4/data/y  1                                            2024-02-20 12:52:14  2024-02-20 12:52:14

Prompted by a comment from @callumforrester, I'm wondering we should be storing this in component parts e.g. scheme (file), netloc (localhost), and path (/tmp/tmp_qqdv5m4/data/y).

Why? This would enables us to index them separately and easily:

  • Filter by scheme ("Show file-based URIs only...")
  • Extract (and possibly rewrite) the paths
  • In the future, extract (and possibly rewrite) the domain (~netloc)

How? The complete specification is large. The httpx.URL docstring has useful ASCII art:

    The components of a URL are broken down like this:

       https://jo%40email.com:a%20secret@müller.de:1234/pa%20th?search=ab#anchorlink
    [scheme]   [  username  ] [password] [ host ][port][ path ] [ query ] [fragment]
               [       userinfo        ] [   netloc   ][    raw_path    ]

We should give some thought to how finely to carve it, and whether or not we want to include everything at the start. (I'm not sure about fragment and I have concerns about userinfo.)

We could keep everything in one column but make it a JSON[B] column. That would make the components index-able. However I think that because the structure is well-defined and each entry is homogeneous, making a column for each component is the way to go.

danielballan avatar Feb 20 '24 13:02 danielballan

I think this would be my first pass:

scheme VARCHAR
netloc VARCHAR
path VARCHAR
query JSON[B]  # array of [key, value] arrays
  • Omit fragment because I've never seen a data_uri that uses it, and we can easily add it if we find one.
  • Omit userinfo because we should not be handling plaintext credentials in our database, full stop. We can revisit how to handle this kind of situation when we add support for data drawn from non-file-based sources.

danielballan avatar Feb 20 '24 13:02 danielballan