TagStudio Switch to SQLite DB storage

Only reason this is going up in this state is because it's sat on my computer too long already, maybe if its out there I'll move faster on the actual code writing.

src\core\sql_library.py representing the in memory data structures src\core\create_db.sql representing the schema for the database

Incredibly rough draft that isn't close to done, Only Location and Entry have had a cursory initial pass for features, with many missing features of the existing library

First thought is that Library handles all instantiation and processing with the remaining objects primarily being memory caching to prevent slow downs in the initial phases when things like src\qt\modals\tag_database.py would otherwise try to read the database 1 tag at a time.

Initial DB Schema Graphic

v9 2 1 000 DB

May 16 '24 05:05 Loran425

The thread about DB is very tl;dr, so if you dont mind some questions about the final DB schema attached:

what is page attribute in the table entry_page for?
what's the difference between entry.path and location.path referred via entry.location?
I dont see entry_attribute used in the code yet (assuming it's still very much WIP), so I'll ask with relevant questions when I will see what's that about.

May 16 '24 05:05 yedpodtrzitko

The thread about DB is very tl;dr, so if you dont mind some questions about the final DB schema attached:

what is page attribute in the table entry_page for?

what's the difference between entry.path and location.path referred via entry.location?

I dont see entry_attribute used in the code yet (assuming it's still very much WIP), so I'll ask with relevant questions when I will see what's that about.

entry_page is part of the group (formerly collations) functionally. So page would be what page of that UI view it appears on. Locations could probably be better referred to as directories, allowing 2 requested features.

multiple directories within a single library
Allowing the TagStudio database and other files TagStudio generated files to be placed anywhere at creation time, not just in the root of the library folder.

Entry_attribute replaces all references to fields and tags in the data storage so it's essentially the storage of all the attrs of the key:attr pairs. It maps entries to the metadata with tags as keys (stored in the tag table) and the attrs stored in the entry_attribute table,

~~though thinking this through some more I think multiple tags might have been a missed case because I think this schema needs one row per tag and that would cause a primary key clash if you had a tagbox (tag group) with more than 1 child tag. So it might need an integer primary key rather than the current Title_tag/Entry key~~ Been so long since I thought about this part I forgot it's actually just tags that get assigned, and then the tag box grouping is handled on the UI side if I remember correctly.

May 16 '24 06:05 Loran425

Question on ignored extensions, is the plan that files with those extensions are ignored by the database (no entries generated) or just hidden from the UI? Trying to see if that info should be stored as a UI settings item, or as another table in the DB (currently commented out)

May 17 '24 04:05 Loran425

Question on ignored extensions, is the plan that files with those extensions are ignored by the database (no entries generated) or just hidden from the UI? Trying to see if that info should be stored as a UI settings item, or as another table in the DB (currently commented out)

I was intending on them being hidden on the UI side so the library doesn't have to rescan whenever you make changes to the ignore list.

May 17 '24 04:05 CyanVoxel

This PR could get quite big (which is okay) as it's reimplementing many core features. But could we use this as a starting point to refactor some components out of here before continuing? Namely:

Decouple the Library from the storage backend

Let the storage backend handle data storage in the DB. The library can manage CRUD, caching, linking, and other management-related features but let the storage backend handle (and optimize) the implementation.

Note: I would also not create a Python object for each entry as if we expect hundreds of thousands to millions of files; I don't think objects managed by GC would be ideal

Isolate the filesystem implementation from TagStudio internals

This would fix the inability to reference files if they move or are deleted. We should have a module that handles the management of files, e.g., their IDs, location, system metadata, etc., and provide an API for libraries to interact with them.

This would use inodes in *nix OS's, BY_HANDLE_FILE_INFORMATION for NTFS systems, and the respective for other systems internally to manage OS specific metadata and provide a single API for consumers, like libraries
This would own the API implementation for things like watching for filesystem changes, such as in #125, for example

Scopes and defaults

Instead of each library managing its implementation of Tags, their storage, and their defaults, have the Tag implementation be separate.

Since the storage is already abstracted, the Tag Manager can handle this by creating, managing, and storing tags and their relationships (not sure if this is a goal, but tag relationships could be more than just parent-child) in Global Scope. Then, the application could provide a UI to manage these (and import across libraries), and individual libraries can manage local tags and their file associations.

This would also allow for easy imports of tags and moving them around libraries in a user-friendly fashion. In the future, if we want user plugins for adding tags (like image classification or OCR plugins), that would interop with this API for adding tags and then the libraries API for linking them.

I'm happy to start on some of these (like filesystem and storage), but it's up to @CyanVoxel to see if he thinks this is a good direction.

May 17 '24 20:05 DannyAlas

Thanks for the comments yeah this really would be a big one. I just wanted to get the discussion going and loop in some of the GitHub crowd.

This PR could get quite big (which is okay) as it's reimplementing many core features. But could we use this as a starting point to refactor some components out of here before continuing? Namely:

Decouple the Library from the storage backend

Let the storage backend handle data storage in the DB. The library can manage CRUD, caching, linking, and other management-related features but let the storage backend handle (and optimize) the implementation.

I believe this was one of the end goals for this though definitely not touched on in the first stages. To make sure I'm on the same page this is basically saying the project architecture shifts and now you have a library acting as middleware? it never touches the disk and never touches the GUI just acts as the connection point/API for both storage backends and GUIs?

Note: I would also not create a Python object for each entry as if we expect hundreds of thousands to millions of files; I don't think objects managed by GC would be ideal

Agreed, it was never the intention for an entire library to live in memory at once long term but since that's how it's currently implemented I was looking at incremental changes to make that more possible.

Isolate the filesystem implementation from TagStudio internals

This would fix the inability to reference files if they move or are deleted. We should have a module that handles the management of files, e.g., their IDs, location, system metadata, etc., and provide an API for libraries to interact with them.

This would use inodes in *nix OS's, BY_HANDLE_FILE_INFORMATION for NTFS systems, and the respective for other systems internally to manage OS specific metadata and provide a single API for consumers, like libraries

This would own the API implementation for things like watching for filesystem changes, such as in Automatic detection of filesystem changes #125, for example

This level of filesystem interaction is well beyond my existing knowledge but I would be interested in learning about it, I'm not seeing clear ways for these metadata structures to resolve back to their file data so that things like thumbnails and opening with system default viewers would be achievable without falling back to system calls to resolve the filename. or is the thought more that this implementation would scan a directory, resolve the filesystem ids from the file names and use that to internally translate between file names and OS level file identifiers? (e.g. I move C:\users\loran425\downloads\test.png to C:\users\loran425\pictures\test.png the file path has changed but the OS level file identifier hasn't so if I was scanning both downloads and pictures the existing TagStudio metadata would automatically be applied because its tied to that ID not the path of the file?)

Scopes and defaults

Instead of each library managing its implementation of Tags, their storage, and their defaults, have the Tag implementation be separate.

Since the storage is already abstracted, the Tag Manager can handle this by creating, managing, and storing tags and their relationships (not sure if this is a goal, but tag relationships could be more than just parent-child) in Global Scope. Then, the application could provide a UI to manage these (and import across libraries), and individual libraries can manage local tags and their file associations.

This would also allow for easy imports of tags and moving them around libraries in a user-friendly fashion. In the future, if we want user plugins for adding tags (like image classification or OCR plugins), that would interop with this API for adding tags and then the libraries API for linking them.

I think this is sort of being shifted towards just by having the tags live in the database, so there wouldn't be a list of defaults in the source code, it would instead be pulled from storage, the current defaults would just be created as defaults in the storage solution since that simplifies the transition. It hasn't really been discussed from I've seen on having Global Scope items, multiple directories within the file system and allowing the storage location and entries live in different places has been discussed as likely improvements.

I'm happy to start on some of these (like filesystem and storage), but it's up to @CyanVoxel to see if he thinks this is a good direction.

May 18 '24 17:05 Loran425

I believe this was one of the end goals for this, though definitely not touched on in the first stages. To make sure I'm on the same page this is basically saying the project architecture shifts and now you have a library acting as middleware? it never touches the disk and never touches the GUI just acts as the connection point/API for both storage backends and GUIs?

Kind of; essentially, I'm saying to Separate Concerns. For now, abstract out the storage implementation specifics from the TagStudio Library class/implementation. We could do a Factory or Prototype pattern or just provide an Abstract implementation. The Library should be agnostic to the storage backend. Then each storage implementation would handle figuring out how actually to implement the methods. (and avoid tangling the GUI with any of this, it becomes a big hot mess really fast) See projects like Napari for an idea of structuring larger PyQt projects.

class StorageInterface(ABC):
    @abstractmethod
    def attatch_tag_entry(self, tag: Tag, entry: Entry) -> None:
        pass
    @abstractmethod
    def link_tags(self, tag1: Tag, tag2: Tag, association: Association) -> None:
        pass
    ...

or is the thought more that this implementation would scan a directory, resolve the filesystem ids from the file names and use that to internally translate between file names and OS level file identifiers?

We don't need to translate between file names and the ID. The file name, path, ID, and other metadata are already attached to the file. If we use the path as the identifier, we run into linking issues as files get moved around, and if we use a hash, when internal data is modified (like if you crop a photo), the hash changes.

The ID is a more consistent identifier (it's not guaranteed to always be the same, like on Windows, if the file moves drives the volume ID, a part of the whole id, changes). But take for example, the directory below where Pictures is the monitored library directory.

Pictures/
├── Screen Shots/
│   └── lol_screenshot.png
└── Games/
    └── LOL/

If I have all my tags already associated with the png. If it was to then move the file under games:

Pictures/
├── Screen Shots/
└── Games/
    └── LOL/
        └── lol_screenshot.png

We would lose the association as the path has changed. This could get really bad if you're moving around more than just a few files after you've spent time tagging them. And if I happen to crop or modify it in some way after, most any hash I know of (md5, sha, crc64) would change (and they're also expensive to calculate as the file size grows). The ID would not. Preserving the links. Not perfect but I believe it's better.

An example implementation for this:

def _filetime_to_dt(ft):
    us = (ft.dwHighDateTime << 32) + ft.dwLowDateTime
    us = us // 10 - 11644473600000000
    return datetime.timestamp(us / 1e6).fromtimestamp(datetime.UTC)

def _get_windows_metadata(file_path: str):
    try:
        file_handle = ctypes.windll.kernel32.CreateFileW(
            file_path, 0x00, 0x01 | 0x02 | 0x04, None, 0x03, 0x02000000, None
        )
        if file_handle == -1:
            raise ctypes.WinError()
        info = ctypes.wintypes.BY_HANDLE_FILE_INFORMATION()
        if not ctypes.indll.kernel32.GetFileInformationByHandle(file_handle, ctypes.byref(info)):
            raise ctypes.WinError()
        ctypes.windll.kernel32.CloseHandle(file_handle)
        return {
            "path": file_path,
            "uid": f"{info.dwVolumeSerialNumber}{info.nFileIndexHigh}{info.nFileIndexLow}",
            "size": (info.nFileSizeHigh << 32) + info.nFileSizeLow,
            "creation_time": _filetime_to_dt(info.ftCreationTime),
            "last_access_time": _filetime_to_dt(info.ftLastAccessTime),
            "last_write_time": _filetime_to_dt(info.ftLastWriteTime)
        }
    except Exception as e:
        return {"error": str(e)}

def _get_unix_metadata(file_path):
    try:
        stats = os.stat(file_path)
        return {
            "path": file_path,
            "uid": f"{stats.st_dev}{stats.st_ino}",
            "size": stats.st_size,
            "creation_time": datetime.fromtimestamp(stats.st_ctime),
            "last_access_time": datetime.fromtimestamp(stats.st_atime),
            "last_write_time": datetime.fromtimestamp(stats.st_mtime)
        }
    except Exception as e:
        return {"error": str(e)}

May 19 '24 21:05 DannyAlas

Kind of; essentially, I'm saying to Separate Concerns. For now, abstract out the storage implementation specifics from the TagStudio Library class/implementation. We could do a Factory or Prototype pattern or just provide an Abstract implementation. The Library should be agnostic to the storage backend. Then each storage implementation would handle figuring out how actually to implement the methods. (and avoid tangling the GUI with any of this, it becomes a big hot mess really fast) See projects like Napari for an idea of structuring larger PyQt projects.

I can see the flexibility gain of such a system, I'll look into the Abstract classes and Prototypes a bit more, I'll admit I tend to lean away from them because I'm not normally writing things that need plugins or configurable backends.

For napari I see they went prototypes but that repo is a lot to take in to try and understand the structure of what and why they might have done something. I'll see if I can look over it a bit more when I have more time.

We don't need to translate between file names and the ID. The file name, path, ID, and other metadata are already attached to the file. If we use the path as the identifier, we run into linking issues as files get moved around, and if we use a hash, when internal data is modified (like if you crop a photo), the hash changes.

I think I agree and am following on this. So to lookup tags from a file you would select a file, parse the system metadata and use the system ID as the Entry id, so that no matter where that file lives (windows drive changes excluded) the tags and other metadata are applied correctly. or in an active use scenario the you have a GUI it loads a library. that library has a storage system agnostic way of retrieving a list of files that are part of the library (if a file moves outside the library then it won't be displayed but unless the metadata was cleaned up it would relink once it was returned to the library). Then to collect the TagStudio specific metadata it at some point (instantiation, searching or displaying tags) parses the file ID and requests the info from the library. So the GUI or another module of the Library is still operating on Directories & Filenames to know where to look but the internal referencing of the metadata is based on this file ID. Is that basically what you are recommending?

May 20 '24 00:05 Loran425

I can see the flexibility gain of such a system, I'll look into the Abstract classes and Prototypes a bit more, I'll admit I tend to lean away from them because I'm not normally writing things that need plugins or configurable backends. For napari I see they went prototypes but that repo is a lot to take in to try and understand the structure of what and why they might have done something. I'll see if I can look over it a bit more when I have more time.

Napari is a great project, and I recommend giving it a look, but it has a different goal. We don't need to copy its systems per se -- the idea is just that they've been able to manage the separation of concerns pretty well in a larger Python Qt project. PyQt is nice as it's really easy to get started and have an MVP fast, but as soon as it grows in complexity and in contributors, the difficulty can ramp up fast. Separation of concerns, types, and documentation all really help here.

I think I agree and am following on this. So to lookup tags from a file you would select a file, parse the system metadata and use the system ID as the Entry id, so that no matter where that file lives (windows drive changes excluded) the tags and other metadata are applied correctly. Or in an active use scenario the you have a GUI it loads a library. that library has a storage system agnostic way of retrieving a list of files that are part of the library (if a file moves outside the library then it won't be displayed but unless the metadata was cleaned up it would relink once it was returned to the library). Then to collect the TagStudio specific metadata it at some point (instantiation, searching or displaying tags) parses the file ID and requests the info from the library. So the GUI or another module of the Library is still operating on Directories & Filenames to know where to look but the internal referencing of the metadata is based on this file ID. Is that basically what you are recommending?

Exactly! This should minimize relinking and broken link annoyances for the user. They can move files around, delete and restore them, have files with the same name, etc. all while the metadata (Tags) for the files are magically linked. (We'd want some recycle bin and archival features as well for deleting files).

May 20 '24 01:05 DannyAlas

Napari is a great project, and I recommend giving it a look, but it has a different goal. We don't need to copy its systems per se -- the idea is just that they've been able to manage the separation of concerns pretty well in a larger Python Qt project. PyQt is nice as it's really easy to get started and have an MVP fast, but as soon as it grows in complexity and in contributors, the difficulty can ramp up fast. Separation of concerns, types, and documentation all really help here.

Yeah wouldn't think about copying verbatim just looking for an understanding of the separation. After exploring for a little bit and especially with the potential for future plugins I'll be looking at protocols for this PR but still open to changes if there's a better suggestion.

Exactly! This should minimize relinking and broken link annoyances for the user. They can move files around, delete and restore them, have files with the same name, etc. all while the metadata (Tags) for the files are magically linked. (We'd want some recycle bin and archival features as well for deleting files).

Not going to lie, that sounds pretty appealing, I'm sure there are still some cases that this won't catch but we would have those either way. I'll probably start working that way unless I hear direction otherwise or there are solid points against this.

May 20 '24 01:05 Loran425

TagStudio TagStudio copied to clipboard

Switch to SQLite DB storage

Initial DB Schema Graphic

Decouple the Library from the storage backend

Isolate the filesystem implementation from TagStudio internals

Scopes and defaults

Decouple the Library from the storage backend

Isolate the filesystem implementation from TagStudio internals

Scopes and defaults

TagStudio
TagStudio copied to clipboard