product-backlog SPIKE - Can we detect image areas in contrast to text areas in a native PDF

"Native" PDF - different from a PDF made from a scanned document. These documents can have different elements (image, text, etc) that are recognized differently while editing the PDF, but it's not clear that this is detectable when viewing the document as a saved PDF.

"Detect" in this case has to do with how we might act w/r/t image annotation. If we detect that we're hovering over an image, we might allow a user to create an image annotation by just clicking and dragging without any other initial clicks, much in the same way we allow text selection to create an annotation without the user triggering a text annotation mode. We would allow this sort of selection when you were not over an image embedded in the native PDF.

Mar 14 '25 20:03 mkdir-washington-edu

I added a note in https://github.com/hypothesis/product-backlog/issues/1648#issuecomment-2728682957 about this. If I understand this issue correctly, it doesn't matter whether the PDF is native or scanned with an OCR-ed text layer added later. What matters is whether there is a PDF text layer (as opposed to image pixels which happen to contain text).

Mar 17 '25 09:03 robertknight

@robertknight I suppose I thought an "image area" in a native PDF could be distinct from a "non-text area", in that when building the PDF you might have places where you add images, and also places that are blank (or contain other things, like text, etc), and the purpose of the Spike is to know if we can detect those image areas.

Adobe has tools inside of it to do this, such as one where you can export all images from a PDF without just turning the whole PDF into an image. Which doesn't mean we can detect images.

Mar 17 '25 12:03 mkdir-washington-edu

@robertknight I suppose I thought an "image area" in a native PDF could be distinct from a "non-text area", in that when building the PDF you might have places where you add images, and also places that are blank (or contain other things, like text, etc), and the purpose of the Spike is to know if we can detect those image areas.

Detecting selectable text in a PDF is easy, as PDF.js handles that for us when it builds the text layer. Detecting what a user would consider an "image", and the boundaries of such an image, is more difficult. At the PDF level, "images" can take the form of bitmaps which are placed on a page and transformed, collections of vector graphics, or perhaps a mix of vector graphics and text (eg. for charts). We could perhaps do something where we automatically detect certain kinds of image (eg. via analysis of PDF objects or rendered image pixels) but any such detection would be imperfect and need a way to override it I think.

Mar 17 '25 12:03 robertknight

Closing as not needed.

Jun 12 '25 14:06 mkdir-washington-edu