sist2 icon indicating copy to clipboard operation
sist2 copied to clipboard

specifiable file list for scan and incremental index

Open ghost opened this issue 3 years ago • 11 comments

sist2 version: master

Platform (Linux or Docker, x86-64 or arm64): linux

Elasticsearch version: unrelated

sist2 scan --increment is a nice feature. But it's still a bit slow on folders with millions of small files. I have one that is roughly 50GB and the --increment isn't helping that much compared to starting from scratch.

I am thinking about using fswatch to aggregate the changes over a specific period of time into something like a TODO list for sist2 scan in addition to the existing PATH option. Unlike PATH which requires sist2 to traverse a from a root given by PATH , sist2 can directly jump into the changed files given the by this "TODO" file.

ghost avatar Nov 12 '20 14:11 ghost

Also, will the current version of sist2 remove a file from elasticsearch when a file is no longer in the index produced by sist2 scan? Didn't see any code related to this, though.

ghost avatar Nov 12 '20 15:11 ghost

will the current version of sist2 remove a file from elasticsearch when a file is no longer in the index

No, it does not (unless you specify --force-reset)

simon987 avatar Nov 12 '20 16:11 simon987

I was trying to push an index that is scanned with sist2 scan --incremental to Elasticsearch, but it seemed to overwrite the every existing document on Elasticsearch.

I would suggest to make sist2 scan --incremental only write the incremental part to a new __index__xxx each run. Maybe optionally skip the following line, and it will simply ignore the unchanged files in the new index. This seems to make sist2 index only send the latest __index__xxx to Elasticsearch after the scan. https://github.com/simon987/sist2/blob/95bbe39afc4f03281ea5716e7784aa1f1fba2edd/src/parsing/parse.c#L51

The incremental part __index__xxx can also be merged into a large lmdb file during scan, like the one you created for thumbnails, with hash(filename) as keys, and maybe (mtime,uuid) as values. Then during the next scan, sist2 scan --incremental can check the mtime in the lmdb and decide whether to put a file into scanning list.

ghost avatar Nov 13 '20 08:11 ghost

~~Also, the thumbnails in the lmdb created by sist2 can be beneficial for many other programs like https://github.com/filebrowser/filebrowser, but right now it's a bit complicated to get the uuid as it requires elasticsearch.~~

~~Creating a lmdb file for mapping the file path to uuids would make this possible, and I am willing to make a fork of filebrowser for sist2. :)~~

See #127

ghost avatar Nov 13 '20 08:11 ghost

Hello @acc557,

I almost never use incremental scan myself, I noticed that in the v2.8.5 version the incremental scan is essentially broken - if you could try again with v2.9.0 and tell me if you see a significant performance improvement that would be great. I think that with the new fix it should remove the need to implement this feature

simon987 avatar Dec 31 '20 14:12 simon987

It's still a bit slow on my side.

I have mixed content (.jpg, .doc, etc) but most of the doesn't change between two scans, and that's why I am interested in incremental scans.

Also, a partial index generated by incremental scans would also greatly save time when submitting the index to elasticsearch.

ghost avatar Jan 03 '21 06:01 ghost

@acc557 Hey, I would be interested in learning more about how you are planning to use fswatch.

I like the idea of using that for like a TODO list.

dpieski avatar Jan 03 '21 07:01 dpieski

@dpieski I would make a script to save the list of newly created files and run sist2 with the list every hour. It would be even better if sist2 can take the list from stdin, then the entire process can be as easy as fswatch | pathfilter.py | sist2 or find | pathfilter.py | sist2

ghost avatar Jan 03 '21 08:01 ghost

Will this be supported soon?

ghost avatar Dec 26 '21 04:12 ghost

It's low priority for me but I can bump it up a bit if many people are interested in that feature

simon987 avatar Dec 26 '21 14:12 simon987

I added the changes in 81008d8936945907c3a4e8d195fc523b70a0bdd5, in theory it should work but I have not yet tested it thoroughly. Use --list-file - to read from stdin

simon987 avatar Dec 29 '21 23:12 simon987