zotero-ocr Add configs for DPI, Page Segmentation Mode, and Zotero non-linked attachments

Add configs for DPI, Page Segmentation Mode, and Zotero non-linked attachments

Open danpf opened this issue 2 years ago • 3 comments

Thought I'd take a stab at this.

Added outputDPI and outputAsCopyAttachment as configuration options.

It seems to work, but I'm unable to get it to work with group libraries - do you have any idea why that might be? briefly:

It works when I have a pdf selected on my personal 'My library' sub-collection, but when I use it on something selected in a sub-collection in my 'group library' I get errors like (below). The errors happen with the zotero-ocr plugin as well so maybe I shouldn't be basing my logic off that plugin and that's my problem.

[JavaScript Error: "Parent item 1/4Q5DY97J not found" {file: "chrome://zotero/content/xpcom/data/item.js" line: 1537}]

My guess is that for some reason in group libraries parents are mangled in the database, but I'm not sure how to check or confirm. because the code to me appears correct and this line https://github.com/danpf/zotero-ocr/blob/9eb9a8ec9a5ada40be27d07ca6de847637c14d2b/chrome/content/zoteroocr.js#L105 seems to be returning the right stuff.

I made a post in zotero dev about the issue but didn't get a response: https://groups.google.com/g/zotero-dev/c/LVmcjIMqYvA

Jun 29 '22 05:06 danpf

Not sure if you are interested in this @stweil

but I got a response from the Zotero devs, and was able to get this PR fixed for Group Library + 'hard' attachments. Their API is currently incompatible with linked attachments in the Group Libraries section. I think it only would make sense for them to implement that in the context of network drives, so they probably won't address that.

Docs: This PR adds 3 new options to ZoteroOCR

The ability to modify the output DPI
- The default is set to 300
The ability to modify the Tesseract Page Segmentation Mode (PSM)
- There are many PSM options you may want to utilize when running Tesseract
- See https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html -- section Page segmentation method (posted below for ease) also in tesseract --help-extra
The ability to add the new PDFs as attachments rather than 'linked files'

I have confirmed that this PR works on an M1 macbook, and here is a new screenshot of the settings panel

If you would be interested in merging, please confirm that it works on your device as well. I don't normally touch JS.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Mar 11 '23 01:03 danpf

zotero-ocr zotero-ocr copied to clipboard

Add configs for DPI, Page Segmentation Mode, and Zotero non-linked attachments

zotero-ocr
zotero-ocr copied to clipboard