fscrawler
fscrawler copied to clipboard
Add new setting `checksum_as_id`
We are indexing several folders many of which have duplicate files. We would like to index only one copy of the duplicate files in the index. One way which ES provides is to write a separate query in logstash for deduplication. Is it possible to change the document id value to checksum value at the time of writing to ES index? This will help in preventing writing of duplicate files to the index.
That's already the case but the folder is taken into account as well.
So we could add an option to ignore the folder name may be.
Could you please tell whether there are any settings that can be done in the yaml file to change the document id to a user defined field (checksum in this case). I could not locate it in the documentation. Thanks
Actually, I think you could use https://fscrawler.readthedocs.io/en/fscrawler-2.9/admin/fs/local-fs.html#using-filename-as-elasticsearch-id
The same filename will generate only one document.
I have tried using "filename_as_id: true". This works with same filename whereas our requirement is to use "checksum_as_id: true" so that identical files are not indexed again, This does not work in the present version. Could you suggest any workaround Thanks
Do you mean checksum on the binary content itself?
yes, that way i can prevent duplicate files getting indexed from the folders which contain duplicates.