tiled Implement built-in writable file-backed server

[Notes from discussion with @tacaswell]

Basic idea

With tiled serve directory ... the user has no direct control over the metadata, except via customizing Adapters in advance. It's completely up to the Adapters to infer structure_family, structure, metadata, and specs from files.

tiled serve directory enables writing only via side-band (direct access to the filesystem) never via the client. It continually watches the directory for changes. It will use a SQLite file to cache information, for speed on second startup.
tiled serve directory --static will make full use of that SQL cache by assuming that nothing will change and return error status codes if something does. It will not discover new files or cope with altered files without an explicit update of the cache. In this mode, the SQLite file is a cache (it can be fully regenerated) but we are treating it as a point of truth.

With tiled serve writable, the user assumes control over structure_family, structure, metadata, and specs when data is uploaded. The user can also request to index existing files that were not uploaded by the client, falling back to the same inference mechanism used by tiled serve directory. In this mode, tiled will place a special file (.tiled-writable?) to indicate that the SQLite file contains more information that could be inferred from just the files; i.e. it's not just a cache.

tiled serve writable --layout {literal, scalable} PATH

This keeps a SQLite file with metadata, specs, structure, etc. The read-only tiled serve directory ... will also gain a SQLite file for the same reason. It will gain a --static flag to turn off the directory-walker and rely only on SQLite for the directory structure information.

In static mode, we may want to regenerate to SQLite files, so we may need a command to do that out of band. We have open questions about how this interacts with writable.

tiled register --all PATH

How the file-walker works

On first access of each node, capture metadata and structure in a SQLite file. The location of the file is:

in the directory
in ~/.cache/tiled/... somehow connected to the path of the directory
or override by an explicit parameter

On second access, rely on the SQLite for fast information about that node.

If we are in watch (non-static) mode and a file is added, do nothing until it is accessed. It might still be being written to and not yet valid to parse. If we are watching and a file is removed, mark it as stale in the index. If we are watching and and a file has changed, mark it as stale in the index.

If we go to access a file and the observed structure does not match the indexed structure, 404 with a clear error message that something is there but a re-discovery is needed.

If we are started in watch mode and there is an existing database, mark everything as stale. It may have changed while we were not watching.

If we are started in static or writable mode, assume the database is correct. No new files will be discovered. Any files that have changed or been removed may response with an error code.

Schema

key
structure_family
structure
metadata  # as zstd-compressed msgpack
specs
mimetype
data_uris
created_at
updated_at
last_mtime
last_filesize
stale
parent  # as the tiled path to the parent like /a/b/c

Aug 16 '22 17:08 danielballan

class Node(Timestamped, Base):
    """
    This describes a single Node and sometimes inlines descriptions of all its children.
    """

    __tablename__ = "nodes"

    # This id is internal, never exposed to the user.
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)

    key = Column(Unicode(1023), index=True, nullable=False)
    parent = Column(Unicode(1023), index=True, nullable=False)
    structure_family = Column(Enum(StructureFamily), nullable=False)
    structure = Column(JSONDict, nullable=True)
    metadata_ = Column("metadata", JSONDict, nullable=False)
    specs = Column(JSONList, nullable=False)
    stale = Column(Boolean, default=False, nullable=False)

    __table_args__ = (
        UniqueConstraint("key", "parent", name="_key_parent_unique_constraint"),
    )


class DataSource(Timestamped, Base):
    """
    This describes a a file/blob or group of files/blobs.

    The mimetype can be used to look up an appropriate Adapter.
    The Adapter will accept the data_uri (which may be a directory in this case)
    and optional parameters.

    The parameters are used to select the data of interest for this DataSource.
    Then, within that, Tiled may use the standard Adapter API to subselect the data
    of interest for a given request.
    """

    __tablename__ = "data_sources"

    # This id is internal, never exposed to the user.
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)

    node_id = Column(Integer, ForeignKey("node.id"), nullable=False)
    fields = Column(Unicode(4095), nullable=True)
    # This data_uri may resolve to a directory, with parameters used to select out
    # specific files. See Assets for data_uris that always resolve to a specific file/blob.
    data_uri = Column(
        Unicode(1023), nullable=False
    )  # never contains templates like resource_path does
    mimetype = Column(
        Unicode(1023), nullable=False
    )  # if directory, use multipart/related;type=image/tiff for example
    parameters = Column(
        JSONDict(1023), nullable=True
    )  # which part of the directory or file

    node = relationship("Node", back_populates="node")


class Asset(Base):
    """
    This tracks individual files/blobs.

    It intended for introspection and forensics. It is not actually used
    when doing routine data access.

    Importantly, it does so without any code execution. For example, it will
    include all the HDF5 subsidiary files for HDF5 files that use external
    links.
    """

    __tablename__ = "assets"

    # This id is internal, never exposed to the user.
    id = Column(Integer, primary_key=True, index=True, autoincrement=True)

    data_source_id = Column(Integer, ForeignKey("data_source.id"), nullable=False)
    # This always resolves to one file/blob, never a directory.
    data_uri = Column(Unicode(1023), unique=True, nullable=False)

    # We could potentially add a content-based hash here.

    data_source = relationship("DataSource", back_populates="data_source")

Sep 12 '22 20:09 danielballan

Latest thought: drop data_uri from DataSource and rely on Assets for this. Some DataSource will have one Asset and some will have many.

Sep 12 '22 20:09 danielballan

This is going to take some time to get right, and I think we need stable /v1 routes before then. I am removing this from that label. It may in fact be possible to add all this in a backward compatible way, and if not we will just use a /v2 to do what we need to do.

Nov 02 '22 01:11 danielballan

Most of this landed in #445. The remainder will be covered by #451. There are some useful ideas in here for #451 so I will leave this open.

Jul 03 '23 16:07 danielballan

I think everything from this has been captured. Closing.

Jul 12 '23 03:07 danielballan