WebPlotDigitizer icon indicating copy to clipboard operation
WebPlotDigitizer copied to clipboard

Feature Request: Supervised Extraction from Vector .pdf Images

Open billdenney opened this issue 4 years ago • 4 comments

I often work with vector .pdf images. They contain essentially perfect representations of the data, but can be difficult to work with.

Given the integration with pdfjs, it would be interesting as a research project to try to make it easier to get that exact information out. The things that I'm thinking of are:

  • General:
    • Allow a "snap to nearest pdf object" as a part of the automated data extraction panel.
    • Within that "snap", ideally, there would be 3 ways to do the snap:
      • Centroid of object
      • Left end of object (usually to be used with a line)
      • Right end of object (usually to be used with a line)
    • And, even better would be if you could:
      • Visually remove the object after the point is defined (that way, you could extract all objects even when they are overlapping)
        • This would be tricky because you may need to undo it if the selection weren't perfect.
      • Track the linkage between the .pdf object and the data point (I have no idea how that would happen, and it may be fragile to the version of pdfjs used and therefore less useful; or it may necessitate a request to pdfjs to have some form of persistent identifier for the pdf objects)
  • Axis calibration:
    • Tick marks are often exact for the x- and y- locations of the axis
  • Data points:
    • Data points could often be represented as the centroid or median location of a .pdf element (or set of elements, for example an "x" for a point could be two lines).
    • Data points could also be represented as the left or right end of a line.

Overall, this seems likely tricky to implement, but it could be very useful.

billdenney avatar Jul 23 '19 13:07 billdenney

An intermediate version of this could be support for .svg files as vector images (they currently appear to be supported, but they are converted to bitmapped images on the canvas). While support for .svg seems unusual (you have the data), I can create a .svg file from a .pdf with Inkscape (https://inkscape.org/), but there is still a lot of work to go from that .svg file to data.

(Or maybe, vector .svg file support is a separate GitHub issue...)

billdenney avatar Jul 23 '19 14:07 billdenney

I have done some experiments on this in the past and might still have some Python code lying around that works with vector graphics in PDFs or SVGs. I'll have to dig that out again to comment further on this.

ankitrohatgi avatar Jul 24 '19 05:07 ankitrohatgi

It looks like pdfjs has experimental support for svg (https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#backends). Perhaps a solution would be to optionally allow visualization using either the canvas or the svg back end, and if svg is used, enable all the above features? (The risk is that the rendering is imperfect since it's an experimental backend.)

billdenney avatar Jul 24 '19 12:07 billdenney

Adding support for vector graphics operations is going to be fairly time consuming and not something I want to take up anytime soon unfortunately.

ankitrohatgi avatar Jul 29 '19 03:07 ankitrohatgi