pdfannots Various new printers (json, jsonl, csv, and todo)

Thank you for the new printer interface. It enabled me to add three new output formats:

jsonl Similar as the original json, but with the output of one file per line. The wrapping in a list, especially the final ending ] from the regular json output is not required in this format. Hence, parsing can happen on the fly and various, large files can be parsed in a pipeline, while the output can be processed line wise. Furthermore, I adjusted the original json output to make it more diverse than this output format, but feel free to reject or split these changes.
csv Larger amounts of data can be handled easily in spreadsheets. This is a very simple but efficient way to work with pdf comments in Excel or LibreOffice calc. The data structure is the same as in the json printers.
todocsv This is a special variant of a csv file. When I work with comments in a pdf, these comments are feedback to texts that I have written myself. Depending on the size of the text and the amount of comments, I struggled with keeping track and not missing a comment in the pdf file. Thus, I use this format in a spreadsheet as a starting point to enrich the feedback with my own comments with restructuring, merging and tracking.

I hope you will find these printers beneficial for the codebase. Please let me know if there is anything I should adjust or extend. I have added a test cases that highlights the most basic usages. Thanks a lot for your neat helper!

Sep 12 '21 12:09 bithappens

Hi, thanks for the PR. I don't have time for a thorough review now (hopefully next week), but one immediate concern: is the jsonl format valid json? At first glance, it doesn't appear to be. It looks like for multiple inputs, then each line of the output is a json document, but the entire output would be invalid.

I have no particular attachment to or use case for the current json format, so submissions to extend it are most welcome, but I think a format named json* needs to produce a valid json file :) Maybe for the use case you're after it would make more sense to run the script once per input file to produce a separate output? In fact, maybe we could restrict json to that mode.

Sep 13 '21 05:09 0xabu

Thank you for having a closer look at the changes. Just for a bit of context about the jsonl format. jsonl is a variant of json designed for streaming data. You are right, that a "regular json parser" cannot directly parse this format, but processors for stream processing and usages in pipes (such as my favorite jq) can process this output by default. In short, this is a legit format, but maybe with a confusing name. So maybe we can find a better name for it.

About the stream processing itself. This format is just an idea inspired by the original json output format (I hate this big surrounding list) and feel free to drop it. Maybe it is also reasonable to emit one comment per line for the stream processing. I personally never used multi-file processing, but at least the single pdfs I work once had several hundred comments. Nevertheless, not a number that is any problematic to handle for modern hardware.

Sep 13 '21 16:09 bithappens

Thanks for enlightening me. I also dislike the current JSON format, and AFAIK it has no users to worry about back-compat, so maybe it makes more sense that:

--format=json supports only a single input file (i.e., drop the printfilename hack) and emits a pure json document (no streaming) containing a list of annotation dicts
--format=jsonl supports multiple input files, similar to your current PR.

The implementation of these would be almost identical, of course.

One other thing I have to think a bit about is the schema. There are already new properties on annotation objects not included here (e.g.: page label), and I'm sure we will have more (people keep asking for highlight colour). With JSON it's easy to add fields, but with CSV I'd be very nervous about doing so.

Sep 13 '21 16:09 0xabu

Also the fact that your PR already has two different CSV formats is a sign of an extensibility issue with that format :)

Sep 13 '21 16:09 0xabu

FYI I pushed some cleanup to the main branch as dbf9a6f9fe4a2a40767761974e277452974c94c6 that might cause a bit of churn but hopefully makes this PR cleaner/simpler, mainly:

drop support for multiple input files with JSON output
make annot_to_dict a standalone function

Maybe it makes sense to start with a PR that just adds jsonl as a new output format?

Sep 30 '21 16:09 0xabu

Closing stale PR.

Mar 30 '23 13:03 0xabu

pdfannots pdfannots copied to clipboard

Various new printers (json, jsonl, csv, and todo)

pdfannots
pdfannots copied to clipboard