Andrew Baumann

Results 27 comments of Andrew Baumann

@belak Hmm, I think you're right about the exploit being re-introduced; I hadn't noticed that when I did this work. But what about the other change (897b31f) to use $jobstates...

For the benefit of others, here's a quick and dirty script to do this by taking the DOIs from the article bundle XML and merging them into an existing JSON...

Thanks for the suggestion. This sounds like a reasonable idea for a feature, but it's also not something I'm likely to work on soon as it's not directly relevant to...

@thiswillbeyourgithub not really, sorry. pdfminer already has the ability to extract images as bitmaps (see calls to render_image in https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/converter.py), but I'm not sure about capturing an arbitrary section of...

I'm not excited about that approach, sorry -- it would add both pdf2image and PIL as dependencies (and from what I can see pdf2image itself just shells out to poppler...

I looked at py-pdf-parser, if you look [here](https://github.com/jstockwin/py-pdf-parser/blob/9326d92b400a485b513c79fc8828d7ba24acc608/py_pdf_parser/visualise/background.py#L9) it appears to be relying on wand (which is a python wrapper for imagemagick) to convert pdf pages to bitmaps. The rest...

Hi, thanks for the PR. I don't have time for a thorough review now (hopefully next week), but one immediate concern: is the jsonl format valid json? At first glance,...

Thanks for enlightening me. I also dislike the current JSON format, and AFAIK it has no users to worry about back-compat, so maybe it makes more sense that: * `--format=json`...

Also the fact that your PR already has two different CSV formats is a sign of an extensibility issue with that format :)

FYI I pushed some cleanup to the main branch as dbf9a6f9fe4a2a40767761974e277452974c94c6 that might cause a bit of churn but hopefully makes this PR cleaner/simpler, mainly: * drop support for multiple...