fscrawler icon indicating copy to clipboard operation
fscrawler copied to clipboard

Add new setting `checksum_as_id`

Open Jasmeet2011 opened this issue 2 years ago • 7 comments

We are indexing several folders many of which have duplicate files. We would like to index only one copy of the duplicate files in the index. One way which ES provides is to write a separate query in logstash for deduplication. Is it possible to change the document id value to checksum value at the time of writing to ES index? This will help in preventing writing of duplicate files to the index.

Jasmeet2011 avatar May 14 '22 06:05 Jasmeet2011

That's already the case but the folder is taken into account as well.

So we could add an option to ignore the folder name may be.

dadoonet avatar May 14 '22 14:05 dadoonet

Could you please tell whether there are any settings that can be done in the yaml file to change the document id to a user defined field (checksum in this case). I could not locate it in the documentation. Thanks

Jasmeet2011 avatar May 16 '22 05:05 Jasmeet2011

Actually, I think you could use https://fscrawler.readthedocs.io/en/fscrawler-2.9/admin/fs/local-fs.html#using-filename-as-elasticsearch-id

dadoonet avatar May 16 '22 18:05 dadoonet

The same filename will generate only one document.

dadoonet avatar May 16 '22 18:05 dadoonet

I have tried using "filename_as_id: true". This works with same filename whereas our requirement is to use "checksum_as_id: true" so that identical files are not indexed again, This does not work in the present version. Could you suggest any workaround Thanks

Jasmeet2011 avatar May 17 '22 16:05 Jasmeet2011

Do you mean checksum on the binary content itself?

dadoonet avatar May 17 '22 16:05 dadoonet

yes, that way i can prevent duplicate files getting indexed from the folders which contain duplicates.

Jasmeet2011 avatar May 17 '22 16:05 Jasmeet2011