TMSU [Proposal] Non-Sequential IDs in SQLite3 Database

Proposal

Instead of using sequential integer ids (for the file, tag and value tables, etc.), using non-sequential (e.g. random) integers allows things like database merging.

Use Case

My intended use case is to maintain a consistent tag database across multiple machines. Currently, I just naively sync the sqlite3 db file across machines. However, this is dangerous as local database copies can get out of sync if I'm not careful to always pull the latest version before adding a file.

Ideally, if my databases get out of sync, I want to be able to seamlessly merge the changes a la git. My initial attempt at a solution was to dump the tmsu database into an SQL file and maintain that in a git repo. That way I could merge the SQL files with git and then just rebuild the database from the merged version.

However, since IDs are sequential, merging these SQL files causes file/tag/value ids to collide. Since it doesn't seem that tmsu is making use of the sequential nature of current ids, changing these to random integers would allow use cases like my merging above without affecting the userside elsewhere.

Aug 16 '16 20:08 xelxebar

This request passes all tests as of commit 15694e6c and should work seamlessly with new and pre-existing databases.

This pull request doesn't touch ids in pre-existing databases. Only new entries will get random integers. For the sake of thoroughness, it might be worth creating a migration function. This would allow the (albeit rare) case where old databases could be merged in the future.

Aug 16 '16 22:08 xelxebar

Hi , random ids could not resolve the problem because there is a possible collision (of course at very low probability) in case of same choice between two db. Why not use an auto increment primary key and properly adjust the sqlite_sequence table with different values big enough between dbs ?

Aug 16 '16 22:08 ghlpo

@ghlpo Yes, collisions are theoretically possible, but if math/rand works as expected, we're producing a uniform distribution of 64-bit integers. In order to have merely a 1% chance of id collisions with integers this size, we'd expect the database to be on the order of exabytes (something like the size of all Google's data) with something like 2^53 entries.

For this use case random integers should be sufficient. Keep in mind that git commits hashes are essentially random, but we don't worry about collisions there either.

Anyway, the reason I can't just use an offset on the ids for one database, is that I'm not looking to merely merge databases, I'm trying to keep the same database in sync across multiple machines. In case I forget to sync but add two separate files on two separate machines, having non-colliding ids allows a seamless merge.

Let me know if my intent is unclear, I tried to explain it in the "Use Case" description above, but perhaps I didn't do a good job. My hope is that this could be combined nicely with something like git-annex and be useful to someone other than just me.

Aug 16 '16 23:08 xelxebar

If you want to be able to merge easily, wouldn't hash-based IDs work better?

For example, you have a database D (the 'original database'). You access it on computer A and tag some files with tag T. You access it on computer B and tag some other files with tag T.

Now you have two databases that you need to merge, D[A] and D[B].

Under the 'random' scheme, merging results in two "different" tags T.

Under a hash-based scheme, the tags T are independently(and AFAICS correctly) assigned the same id -> no duplication.

A separate issue: I do not agree with the idea of doing this always (regardless of whether ids are random or hash based). If it is implemented, IMO it should be optional.

Aug 18 '16 07:08 0ion9

@0ion9 Using hashes for tags and values does seem ideal. However, I opted for random numbers since when using hash ids, we have to be careful when renaming or merging tags; these would force a global scan and update of the database. Also, for files, I'm not sure it's reasonable at all to have them keyed on hashes since file contents and path are mutable.

More than anything, this pull request is intended to start a discussion about the possibility of making tmsu usable in decentralized/distributed setups. For this pull request, I opted simply for easy implementation as a proof of concept kind of thing. Really just think of this as an 'issue' with example code attached.

I do not agree with the idea of doing this always...

What qualms do you have? What ways are you thinking that sequential keys are/could be usefully levereaged?

As far as I can tell, the user cannot tell any difference between id assignment schemes at the moment, but am I missing something here?

Aug 19 '16 17:08 xelxebar

You always have to do a global scan when merging anyway, as silent corruption of data is not acceptable and cannot be avoided under any id assignment scheme.

(git notably 'solves' this problem by reducing the chance of a collision to 1 in (2**159), by using a 160-bit hash [sha1]. Maybe possible for file ids, but utter overkill IMO)

Also, for files, I'm not sure it's reasonable at all to have them keyed on hashes since file contents and path are mutable.

I don't understand this objection. File table data can always be out of date (renames, moves, modifications, as you say), but I don't see how that changes things: the file table is still tmsu's 'best guess' of what is going on, and if it's wrong, a correct merge cannot be done in any case. Can you clarify please?

What ways are you thinking that sequential keys are/could be usefully levereaged?

There's two:

They store smaller. Considering that file_tag is a table of (file_id, tag_id, value_id) tuples, this can have a significant impact, proportional to the number of taggings in your db.
Sequential keys enable chronological sorting of tags, values, files, and taggings (assuming you do not have more than 2**64 of anything ;). I use this for taggings ('latest X files tagged with Y'

What qualms do you have? As far as I can tell, the user cannot tell any difference between id assignment schemes at the moment, but am I missing something here?

I think at the time I was a bit mixed up: I was thinking about your presumption that it was possible to do a quick (non-scanning) merge. Since this isn't true [ie. collisions must, in any case, be handled] , my objections are wholly covered in the above two points.

Aug 20 '16 12:08 0ion9

You always have to do a global scan when merging anyway, as silent corruption of data is not acceptable and cannot be avoided under any id assignment scheme.

Perhaps you're right. My patch may work just as well by replacing the random integer for tags and value ids with some hash for them. I just naively thought that this may cause complications where random integers wouldn't, so in the interest of implementation speed, I opted for the latter.

Maybe submitting this as a pull request was the wrong option as it gives the impression that this is inteded to be some kind of optimal solution to a problem? In my previous comment, I tried to emphasize that this is simply intended as a proof-of-concept and discussion-starter thing.

I'm hoping more that the discussion here is about the main idea rather than the particulars of my quick hack. Keep in mind that what I really want to do is have my multiple (let's say) rsynced folders to keep consistent tags on the files no matter where I tag from. I could just let rsync copy around the database, but then that gives rise to some problems when two separate locations add different tags etc.

Is that something others may be interested in doing? I certainly think so.

They store smaller. Considering that file_tag is a table of (file_id, tag_id, value_id) tuples, this can have a significant impact, proportional to the number of taggings in your db.

My intial feeling here is that this is somewhat of a minor quibble, but I looked into it further to get some real numbers. An empty database is about 100 kB. My database with around 1000 entries and random ids is about 200 kB. With sequential ids this shrinks to about 150 kb. So at this size, random ids do look to be about twice the size per entry compared to sequential.

That said. This is a pretty tiny database. I've already got a hundred files in it, so to grow the database to anything appreciable, say about 100 MB, I'd have to be managing something like 100,000 files in it. These seem like pretty reasonable numbers to me. And I think having cool decentralized support in tmsu would sort of be worth a few MB on my hard drive.

The whole point of this discussion, though, is to see what everyone else thinks! =)

Sequential keys enable chronological sorting of tags, values, files, and taggings

Huh. I'm surprised you use this. Anyway, at even present ids still don't preferctly preserve chronological information as you may rename tags and such, so sorting on them is kind of exploiting an implementation side effect. To me "chonological data" and "unique identifier" seem like odd things to conflate. Maybe it'd be worth creating an issue to put a last_modified date field in the database to allow for proper chronological sorting?

Aug 20 '16 19:08 xelxebar

I'm hoping more that the discussion here is about the main idea rather than the particulars of my quick hack. Keep in mind that what I really want to do is have my multiple (let's say) rsynced folders to keep consistent tags on the files no matter where I tag from. I could just let rsync copy around the database, but then that gives rise to some problems when two separate locations add different tags etc.

In case I haven't been clear: I'm not against supporting this, I just think that that premise 'we can assign ids such that we can just select distinct from each of the tables to merge them' is flawed. Random assignment doesn't achieve that, partitioning the ID space doesn't achieve that, hashing doesn't achieve that.

Consider the problem of merging two tag tables, O and P. P was originally a branch of O, but O has also been updated since. We can presume that it's ok to simply update O (which we can call the 'master' copy) and then copy it to P to achieve sync. Then:

tags with the same ID and same name in both databases can be ignored (no update required). Under random ID scheme, only tags that were in O before O and P diverged are likely to fit this. Under hash ID scheme, 99% of tags that have the same name also have the same ID, whenever and wherever they are created.
tags with the same name and different ID. Either O.id or P.id must be arbitrarily selected, and records in P.file_tag referring to the 'dropped' id must be updated to use the 'kept' id as they are copied into the final file_tag table. Since we are calling O the 'master' copy, let's say we always use O.id to resolve this conflict. Very common under random scheme, pretty rare under hash scheme.
tags with the same ID and different name. This can occur through bad luck (random scheme) or hash collision (hash scheme). In this case, we must search for an unoccupied ID : which we could make efficient by storing the total set of O.ids ahead of time. Then we need to update records from P.file_tag from the old ID to the newly picked ID as they are copied across to the master copy.

The above process is actually fairly simple/efficient IMO: examine both the tag and value tables. generate two temporary lookup tables P.id -> O.final_id. Update the final tag / value tables from the final id specified in the lookup table.

Then update P.file_tag tag and value ids according to the lookup tables.

Then comes the slow part:

Update the file_tags table with all items from P.file_tags that are not already present.

I believe this is a minimal description. There are other complications to be addressed, such as :

'what if either O or P contains intentional untaggings? Won't a naive merge wrongly restore these?' (ditto for intentional removal of tags or values -- you would have to be able to know that the 'removed' state was new and the 'unremoved' state was old. It seems to me some kind of log would be needed, as obviously your 'last_modified' idea isn't relevant to a record that no longer exists ;)
If the user ran 'tmsu merge foo bar' on P, how could we handle that? The process I describe above would mean that foo would reappear in the final DB, but any taggings of 'bar' in P would be preserved as bar.
getting even more pathological, but still pretty realistic: What if the user runs tmsu merge foo bar in O, and tmsu merge bar foo in P?
what if a file is renamed in database P, and also modified?
implications also need to be handled (and have the same ambiguous-update issues)
what if P's (file|directory|symlink)FingerprintAlgorithm is changed so that it differs from O? (answer: everyone's head explodes)
many of these can be addressed through the idea of logging, but we should keep hammering out every possible case so we avoid silent corruption.

Doubling of size is interesting -- it suggests that Sqlite's storage of small values could be more efficient, as typical tag / value / file ids should be 8-16 bit, and your generated IDs seem to be 64bit [ie. I would have expected a factor of more than 2]

A few more sample points:

I've got a database with about 2 million taggings / 845k files; it's 425M in size.
Also a database with 56k taggings / 19k files @ 11M.
both of these have roughly 3200 tags, 400 values.

Anyway, at even present ids still don't preferctly preserve chronological information as you may rename tags and such, so sorting on them is kind of exploiting an implementation side effect.

Actually I made a mistake, sorting taggings is not dependent on ID at all, only on-disk insertion order (which is still an implementation detail of sqlite, admittedly; I don't know whether VACUUMing the database would modify this order, etc).

I don't sort tags or values chronologically, although I can see some use for doing so. Anyway, the insertion-order trick can be applied here, so this is independent of ID assignment too AFAIK.

Aug 21 '16 09:08 0ion9

In case I haven't been clear: I'm not against supporting this, I just think that that premise 'we can assign ids such that we can just select distinct from each of the tables to merge them' is flawed. Random assignment doesn't achieve that, partitioning the ID space doesn't achieve that, hashing doesn't achieve that.

Stoked that you're thinking about this. And yeah, this current patch is super naive. Currently, I just bolted tmsu and git together by keeping an sql dump of the database checked into a git repo and let git handle all the merging via git-hooks, since it has solved that problem pretty well.

Anyway, the idea I really have in my head is something akin to a bunch of tmsu databases that know about each other, with an example workflow something like this:

$ tmsu init
$ tmsu remote add foo_remote [email protected]/tmsu.db
$ tmsu tag file.foo foo bar
$ tmsu sync

And then the tags on foo_remote "just work". Instead of putting a bunch of complicated database merging logic and stuff in tmsu itself, wouldn't it be cleaner to replace sqlite with some distributed database system a la rqlite or the like?

Anyway, appreciate your input!

Aug 21 '16 21:08 xelxebar

Currently, I just bolted tmsu and git together by keeping an sql dump of the database checked into a git repo and let git handle all the merging via git-hooks, since it has solved that problem pretty well.

Do your git-hooks translate id numbers in file_tag in diffs to human readable values? If so, would you consider publishing them?

Anyway, the idea I really have in my head is something akin to a bunch of tmsu databases that know about each other, with an example workflow something like this:

And then the tags on foo_remote "just work". Instead of putting a bunch of complicated database merging logic and stuff in tmsu itself, wouldn't it be cleaner to replace sqlite with some distributed database system a la rqlite or the like?

rqlite / raft seems like it does the right logging stuff to achieve the desired end. However, purely from a maintenance standpoint, either approach requires serious justification; I think I'd want to know what @oniony thinks about them before proceeding further down that track.

Aug 22 '16 04:08 0ion9

Do your git-hooks translate id numbers in file_tag in diffs to human readable values? If so, would you consider publishing them?

https://github.com/xelxebar/tmsu-sql-filters Just for you =)

It's a clean and smudge filter for the SQL database dump. Since it depends on database internals, it's probably pretty fragile, though. Oh, also, just so you know, these scripts throw several things in /tmp and don't bother deleting them just in case things go awry.

I think I'd want to know what @oniony thinks about them before proceeding further down that track.

Definitely.

Aug 22 '16 19:08 xelxebar

TMSU TMSU copied to clipboard

[Proposal] Non-Sequential IDs in SQLite3 Database

Proposal

Use Case

TMSU
TMSU copied to clipboard