pdfannots icon indicating copy to clipboard operation
pdfannots copied to clipboard

Feature: CSV output

Open AbelLykens opened this issue 1 year ago • 6 comments

Would love CSV output like this:

page,type,author,created,text
1,Highlight,John,2023-05-17T11:38:17,Text

Sounds like that should be possible but not sure how. Great tool, thanks!

AbelLykens avatar May 17 '23 18:05 AbelLykens

You can certainly write a printer to do that -- take a look at the Json output for an example: https://github.com/0xabu/pdfannots/blob/658984edebb6bb8409e9ce8bb49ac85ded8f8675/pdfannots/printer/json.py

If you don't want to do that, perhaps take json from pdfannots and convert it to csv: https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq

0xabu avatar May 31 '23 07:05 0xabu

Thanks -- yeah took a quick look, seems possible. Might look at it indeed, thanks for the pointer.

AbelLykens avatar May 31 '23 07:05 AbelLykens

[Beginner's level question]

I would like to ask if there is an option [or rather how to set it] to use encoding that contains Polish and German special signs. I want to implement your algorithm in learning German language. The problem is that the output .txt (json) file does not show any Polish or German special signs.

notepad_uDvH1LirmZ

Correct version text: Lösung contents: rozwiązanie

I tried to modify the json file but I stuck. :/

Console line:

pdfannots "path" -f json > directories\json_to_csv.txt

Some additional information:

  1. The PDF file has been written in Goethe FF Clan font. When I copy the word from the file and paste p.e. to Notepad++/WordPad/browser, it copies the special signs, too.
  2. Currently I can create the .csv file from the .json output, but there are still no German or Polish signs
  3. The same situation takes place when I am trying to create a markdown (.md) file.

Best regards

Proeliorr avatar Aug 18 '23 20:08 Proeliorr

@Proeliorr this has nothing to do with CSV. Why are you commenting on this issue?

In any case, pdfannots always outputs utf8, and indeed 00f6 is the unicode codepoint for ö (https://codepoints.net/U+00F6) -- I think perhaps you need to tell your text editor to use the utf8 encoding.

0xabu avatar Aug 18 '23 20:08 0xabu

@0xabu After some consideration I agree.

The file was re-saved in utf8 encoding, notepad++ sees it as an utf8 encoded file. That is where the problem lies.

notepad++_jlfvWWnyN5

Nevertheless, I will not disturb the given below topic anymore. I think it is not a pdfannots case further. Cheers

Proeliorr avatar Aug 19 '23 19:08 Proeliorr

@Proeliorr I took another look at this, there is something fishy going on with output redirection on Windows. I've created #84 to track it. Luckily it has a pretty simple workaround -- use -o to write output to a file, rather than redirection.

0xabu avatar Aug 20 '23 15:08 0xabu