fscrawler Add new setting `checksum_as

Add new setting `checksum_as_id`

Open Jasmeet2011 opened this issue 2 years ago • 7 comments

We are indexing several folders many of which have duplicate files. We would like to index only one copy of the duplicate files in the index. One way which ES provides is to write a separate query in logstash for deduplication. Is it possible to change the document id value to checksum value at the time of writing to ES index? This will help in preventing writing of duplicate files to the index.

May 14 '22 06:05 Jasmeet2011

That's already the case but the folder is taken into account as well.

So we could add an option to ignore the folder name may be.

May 14 '22 14:05 dadoonet

Could you please tell whether there are any settings that can be done in the yaml file to change the document id to a user defined field (checksum in this case). I could not locate it in the documentation. Thanks

May 16 '22 05:05 Jasmeet2011

Actually, I think you could use https://fscrawler.readthedocs.io/en/fscrawler-2.9/admin/fs/local-fs.html#using-filename-as-elasticsearch-id

May 16 '22 18:05 dadoonet

The same filename will generate only one document.

May 16 '22 18:05 dadoonet

I have tried using "filename_as_id: true". This works with same filename whereas our requirement is to use "checksum_as_id: true" so that identical files are not indexed again, This does not work in the present version. Could you suggest any workaround Thanks

May 17 '22 16:05 Jasmeet2011

Do you mean checksum on the binary content itself?

May 17 '22 16:05 dadoonet

yes, that way i can prevent duplicate files getting indexed from the folders which contain duplicates.

May 17 '22 16:05 Jasmeet2011

fscrawler fscrawler copied to clipboard

Add new setting `checksum_as_id`

fscrawler
fscrawler copied to clipboard