zotero-storage-scanner icon indicating copy to clipboard operation
zotero-storage-scanner copied to clipboard

What criterial on "#duplicate_attachments" ?

Open z5tron opened this issue 6 years ago • 10 comments

I have lots of items are labelled as "#duplicate_attachments", but they are under different zoetro_storage folder, and with different size, different name(title inside the middle panel), different physical file name and modification time.

Is this a feature or bug ?

Thanks.

z5tron avatar Jul 16 '18 19:07 z5tron

I'm not sure what is meant by different zotero_storage meant (different profiles? different libraries? different folders?), but the logic right now is that a file is counted as a duplicate if there are two or more attachments of the same type (doc, docx, pdf, whatnot) under the same reference item.

retorquere avatar Jul 16 '18 19:07 retorquere

I meant the physical folder named "zotero/storage/". But you have explained my questions. Still there is problem: I have a book item with "Google Books Link" (URL link), a epub and mobi, three attachments under this book in total. Each with different file type. It is marked as "#duplicate_attachments".

z5tron avatar Jul 18 '18 19:07 z5tron

I'd have to look at a copy of your database to tell why that happens, I don't have an immediate explanation.

retorquere avatar Mar 21 '19 17:03 retorquere

So, if one item has 2 or more attachments with the same file type, they will be treat as duplicates?

JsHuang avatar Oct 20 '21 03:10 JsHuang

Yes.

retorquere avatar Oct 20 '21 06:10 retorquere

Just a comment to think about: When a reference has supplementary material, I often end up with multiple PDF attachments for one reference ... would it be possible to handle this case with file size rather than file type? (This is not frequent enough to be a big deal, for me at least ... but I'm just throwing it out there in case it matters for others).

bnlawrence avatar Dec 29 '21 16:12 bnlawrence

That wouldn't really help for the cases I made this for. I often had merged duplicates where I acquired substantially similar, but not bit-for-bit equal, versions of the same article.

retorquere avatar Dec 29 '21 16:12 retorquere

To me, "#duplicate_attachments" suggests that the flagged items would contain the same attachment multiple times (in particular, my expectation given this wording was that the files would be identical, or at least have identical hashes under something like md5 or stronger). Would it be feasible to rename this tag to something more explanatory / less prone to misunderstanding, like "#multiple_attachments_of_same_type"?

phirsch avatar Mar 31 '22 17:03 phirsch

Yes, this tag says duplicate_attachments but this is false, they are just attachments of the same type. duplicate_attachments would mean they are byte-for-byte identical (which many are from merging items).

endolith avatar May 22 '22 14:05 endolith

Feel free to submit a PR. Personally I'd consider it a duplicate if the article text is substantially the same.

retorquere avatar May 22 '22 14:05 retorquere