sist2
sist2 copied to clipboard
specifiable file list for scan and incremental index
sist2 version: master
Platform (Linux or Docker, x86-64 or arm64): linux
Elasticsearch version: unrelated
sist2 scan --increment
is a nice feature. But it's still a bit slow on folders with millions of small files. I have one that is roughly 50GB and the --increment
isn't helping that much compared to starting from scratch.
I am thinking about using fswatch to aggregate the changes over a specific period of time into something like a TODO list for sist2 scan
in addition to the existing PATH
option.
Unlike PATH
which requires sist2 to traverse a from a root given by PATH
, sist2 can directly jump into the changed files given the by this "TODO" file.
Also, will the current version of sist2 remove a file from elasticsearch when a file is no longer in the index produced by sist2 scan
? Didn't see any code related to this, though.
will the current version of sist2 remove a file from elasticsearch when a file is no longer in the index
No, it does not (unless you specify --force-reset
)
I was trying to push an index that is scanned with sist2 scan --incremental
to Elasticsearch, but it seemed to overwrite the every existing document on Elasticsearch.
I would suggest to make sist2 scan --incremental
only write the incremental part to a new __index__xxx
each run.
Maybe optionally skip the following line, and it will simply ignore the unchanged files in the new index. This seems to make sist2 index
only send the latest __index__xxx
to Elasticsearch after the scan.
https://github.com/simon987/sist2/blob/95bbe39afc4f03281ea5716e7784aa1f1fba2edd/src/parsing/parse.c#L51
The incremental part __index__xxx
can also be merged into a large lmdb file during scan, like the one you created for thumbnails, with hash(filename)
as keys, and maybe (mtime,uuid) as values.
Then during the next scan, sist2 scan --incremental
can check the mtime in the lmdb and decide whether to put a file into scanning list.
~~Also, the thumbnails in the lmdb created by sist2 can be beneficial for many other programs like https://github.com/filebrowser/filebrowser, but right now it's a bit complicated to get the uuid as it requires elasticsearch.~~
~~Creating a lmdb file for mapping the file path to uuids would make this possible, and I am willing to make a fork of filebrowser for sist2. :)~~
See #127
Hello @acc557,
I almost never use incremental scan myself, I noticed that in the v2.8.5 version the incremental scan is essentially broken - if you could try again with v2.9.0 and tell me if you see a significant performance improvement that would be great. I think that with the new fix it should remove the need to implement this feature
It's still a bit slow on my side.
I have mixed content (.jpg, .doc, etc) but most of the doesn't change between two scans, and that's why I am interested in incremental scans.
Also, a partial index generated by incremental scans would also greatly save time when submitting the index to elasticsearch.
@acc557 Hey, I would be interested in learning more about how you are planning to use fswatch
.
I like the idea of using that for like a TODO list.
@dpieski I would make a script to save the list of newly created files and run sist2 with the list every hour. It would be even better if sist2 can take the list from stdin, then the entire process can be as easy as fswatch | pathfilter.py | sist2
or find | pathfilter.py | sist2
Will this be supported soon?
It's low priority for me but I can bump it up a bit if many people are interested in that feature
I added the changes in 81008d8936945907c3a4e8d195fc523b70a0bdd5, in theory it should work but I have not yet tested it thoroughly. Use --list-file -
to read from stdin