markx icon indicating copy to clipboard operation
markx copied to clipboard

PDF to XML converter

Open karthik opened this issue 12 years ago • 2 comments

One great thing to enhance scholarly writing would be to convert this to semantic markup. This tool http://pdfx.cs.man.ac.uk/ might be super handy for us because we could first export to PDF, then programmatically convert to xml. I'll leave it here as a placeholder.

karthik avatar Dec 19 '12 21:12 karthik

I've written a working python client for this web service (https://gist.github.com/4351598) It takes some time to get a response from the website - about 30-60 seconds - so I'm not sure how to integrate it to markx.

yoavram avatar Dec 21 '12 09:12 yoavram

The right way would be to convert it to (X)HTML (or DocBook/OpenDocument XML) via Pandoc and then apply a stylesheet to get the desired xml. Converting from PDF will definitively loose information, especially on two-column layouts, even if the application from http://www.scfbm.org/content/7/1/7 is used.

tolot27 avatar Feb 28 '13 00:02 tolot27