atuin
atuin copied to clipboard
History deduplication/normalization
I've been thinking about this for a little while, and if it were me, I'd split the history data into separate tables. For example:
-
A table called
commands
or something like that. This would contain fields that are specific to a command itself, not executions of the command - it would contain at least two fields: the command, and the interpreter program's name-or-path, and the combination of both fields would have a unique constraint. -
A table called something like
executions
- this would have a foreign ID to entries in thecommands
table, and would include all the contextual information that might vary between executions which Atuin is already storing or is considering storing - timestamp, working directory, environment variables, the result (for shell commands that's exit status, but an integration could in principle want to capture other stuff - a shell might want to pipe stderr through itself so that it could save a copy of that too, Python might want to store a "pickle" serialization of the return value or raised exception, etc), the host it was executed on, etc. -
I would normalize further in this vein as I noticed big sources of redundancy - for example, almost all commands probably happen on one or a handful of hosts; thousands of commands happen with the same exact set of environment variables; and so on. Those could also be pulled out of
executions
into their own tables and linked back by foreign keys.
Mainly because
- the vast majority of the commands I run in almost any shell or REPL are repetitions of commands I ran before, and a tiny minority of commands makes up a large majority of repetitions
- by far the most common (approximately only) reason I search my history is to replay commands
- I have my shells/REPLs deduplicate history when I can, because to me the commands that were useful even once are far more valuable to keep than repetitions of the same command, and almost always wanted again eventually, and
- my data normalization senses are tingling.
TL;DR: some pieces of history information, like a command string, are much more timeless and repetitive than other pieces, like return value; there is a natural one-to-many relationship between them (and there is a good chance that leads to significant data size and query speed gains "at scale" / for large histories).
I've had similar thoughts for a while now. I'm open to trying it but we do need to be careful with our DB migrations
So there's this approach to migrations I've been advocating for years:
- keep the old tables with the old data in the old schemas,
- new tables are added with the new schemas (slapping a
2
at the end of table names if necessary), - new entries are written to the new tables,
- (optional) when a row would get updated in an old table, that data is deleted and inserted in the new, and
- reads are done from both tables (it's often possible to have this look like just the new table to queries/code: for example, in SQL we can use
UNION ALL
on the select we would do on the new table(s) and the select we would do on the old table(s), with the latter munging as-needed to have its rows look like the former's - and this can be done once in a view definition, decoupling everything else from it).
In principle, I think there's always at least one good factoring of code where this ends up being very clean/readable/maintainable to do. Conditional branches in a minimum of really neatly isolated spots, selecting between logic for each version of our schema which can even be exposed individually as reusable pieces, helping third-parties who also want to write code that works across schema versions.
In practice, I think the biggest challenge is if the code was written in a way that complects things such that this becomes difficult. But I would argue that this is inherently "artificial" difficulty, and that ease/difficulty of doing this is in itself a good measuring stick of certain dimensions of "good" for code, and that making it easy to do this necessarily has significant benefit ripples).
So I don't know exactly how careful you want to be, but we can get very careful if we want.
Has this been abandoned?
It was never worked on, and I wouldn't say it's needed
It's also not going to work so well with
- The record store
- Replacement search engine
But I appreciate the storage benefits. Generally, I'd rather make the storage trade off and gain a simpler sync implementation and better search queries