sumatrapdf icon indicating copy to clipboard operation
sumatrapdf copied to clipboard

Ability to export annotations

Open ribtoks opened this issue 4 years ago • 42 comments

Hi

Thanks for developing Sumatra PDF reader. I was very excited to finally get PDF annotations released in version 3.3. Thank you for the hard work!

One feature that I'm missing though is the ability to export the annotations - in whatever format possible (e.g. .txt). I'm using this to make notes from the book and save them separately. Later I might make Anki flashcards from the notes or just save them in my notebook. There are examples of other software that can do that under Linux (for example, Foliate that can export annotations as HTML, markdown or plaintext) and I'm missing this from SumatraPDF.

Would be incredible to see this feature!

ribtoks avatar Aug 01 '21 09:08 ribtoks

@ribtoks Without a physical example file, this issue would need to be closed since there have been so many changes in annotation handling since 3.3

[LATER EDIT] My Mistake, wrong issue quoted, this topic is an enhancement request

GitHubRulesOK avatar Aug 01 '21 13:08 GitHubRulesOK

@GitHubRulesOK Thank you for fast reply, but unfortunately your reply does not correlate with my question.

I have a PDF file. Any of them with text. I create a highligh: I select text, press "a" or right click and create a highlight of the selected text. Then I would like to export all pieces of text that I have highlighted to a file, say "highlights.txt" (exact way/format does not matter, the export function is what matters).

As for your answer, I never mentioned PDF files with an attachment. Would you mind to double check my question?

Thanks again for answering so fast.

ribtoks avatar Aug 01 '21 13:08 ribtoks

OK my bad Annotations are used for holding exportable text files as well as other file types exporting all annot comments as text content is an alternate usage, I forgot as not a prior ability. saving all comments was not a feature in basic acrobat reader 9, but may be found in more recent PDF editors

GitHubRulesOK avatar Aug 01 '21 14:08 GitHubRulesOK

@GitHubRulesOK Ok, thanks for the clarification. I hope highlights exporting will be implemented somehow. Let me know if I can help.

ribtoks avatar Aug 01 '21 14:08 ribtoks

@kjk as you know all too well there are multiple ways a user can add textual content Annotation can carry embeded text files (a separate open issue, I mistook for request)

Annotation can carry extensive screen text as "tooltips" without a visible object (a recent open issue) Annotation can carry visible "free text" (related to, but not this issue) Annotation either as icon or highlight can "pop-up" comments either via tooltip or editor box There are others

In this case the requirement is to export at least the later group to a text file, a feature of collecting and tagging page comments for fresh export that would as a minimum require re-collating such objects into page order. The most likely request if such an ability is built will be to sort into page order by means of using/showing negative Y offsets

GitHubRulesOK avatar Aug 01 '21 15:08 GitHubRulesOK

I plan on enabling JavaScript bindings for operating on PDF files, like in mutool. This could be implemented as a JavaScript program.

kjk avatar Aug 02 '21 04:08 kjk

@kjk Just a word of caution that if scripting actions are enabled that unlike MuPDF and older Acrobat (where you need to remember to deactivate auto running), I feel the default should be OFF and a manual step be provided to activate on a per use basis.

GitHubRulesOK avatar Aug 02 '21 10:08 GitHubRulesOK

Is there any plan to develop the feature that save annotations separately?

AlexShyXie avatar Sep 10 '21 15:09 AlexShyXie

@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)

GitHubRulesOK avatar Sep 10 '21 16:09 GitHubRulesOK

@ribtoks Could you propose how this format would look like? I assume the output would be then used in some way. I could copy foliate (but it's epub reader, not PDF reader so might not translate 100%). What information should be included? Just the text of annotation? They type (highlight, underline etc.)? Should include page number / position on the page?

I could export in some simple text format, e.g.:

annotation 1
---
second annoation
---
third annotation

Or in json:

[ 
  { "text": "annotation 1", page: 17, .... },
  { .... }]

kjk avatar Sep 10 '21 17:09 kjk

Just to kick off Here is the most basic output from Xchange (I deliberatly kept it simple but it should convey more)

////////////////////////////////////////////////////////////////////////////////////////////////////
// Summary of comments on MyOutput _[note this is a .pdf]_
////////////////////////////////////////////////////////////////////////////////////////////////////

Page: 1
----------------------------------------------------------------------------------------------------
Page: 1
Type: Ink  Author: <None>  Subject: <None>  Date: <None>

Page: 1
Type: FreeText  Author: K  Subject: <None>  Date: 2021-06-20, 04:05:05
Hello World!

Page: 1
Type: Highlight  Author: K  Subject: <None>  Date: 2021-06-20, 22:05:04

	Type: Text  Author: K  Subject: Sticky Note  Date: 2021-09-10, 21:38:00
	what is the context here there is no copied content use SHIFT A next time

So note as requested by OP it has the content for FreeText but not any content from Highlight which was desired . In no case is there a hint of page position (X,Y,dx.dy) nor colour coding as may be visually added for author or subject grouping. Those may be exported by means of an [X]FDF file but that's way more complex as its similar to the PDF page input

GitHubRulesOK avatar Sep 10 '21 20:09 GitHubRulesOK

I am sorry I didn't describe clearly, most time I don't want to save annotations in the original pdf file, so I wonder is there any way can save annotations out of pdf files, when I open a pdf, it can load corresponding annotation file same time. The annotation file format, I think it could be a json? I know it is a big change for a software, maybe sumatra can give a choice, thanks a lot. such as : image

@xh542428798 Not clear from your comment if you are referring to a) exporting annotations that are files e.g. open issue #1602 b) a new feature such as report a list of annotations with their contents as described above c) save annotations external to a PDF as they were in the past and as still done in some other PDF readers (unlikely as problematic)

AlexShyXie avatar Sep 11 '21 04:09 AlexShyXie

@kjk Thank you for working on this issue.

Goals

First of all, I'd like to remind the whole idea why I need it:

  1. exporting highlights to external text editor / notebook (think Joplin, Evernote) to have the gist of the book
  2. Creating Anki cards from some of the highlights from (1)

Format ideas in plain text

  1. As for "simple text format", it can be plain text or Markdown (preferred for me). The information I need there are only highlights and notes, chapter title, subheading title (if there's one), maybe page number. There's no need to know the type of annotation or other technical details in this "simple" text format.

  2. Additionally, json be also quite convenient, it can contain more technical information so that exporter can run some sort of jq template over it and make "simple text format" from (1) themselves. In my eyes this could have been a step 2 of extending "simple text format" from 1.

Examples

I'd like to provide an example of "simple text format", because for json you will know better what "properties" do you have for annotation and for json you can just dump all of them.

---

#### Chapter 3: How to do XYZ - yellow - p. 123

> Here goes the actual quote: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam

---

ribtoks avatar Sep 11 '21 08:09 ribtoks

@ribtoks Your description is good but PDF does not have such construction as paragraph or chapter unless defined by human eye. The highlight has a highly complex structure that can be considered as an overlay above a single page of content thus does not know about the underlying text or "chapter" only its co-ordinates on the page. It is the user that inserts the comments (unless auto copied at time of highlight). Thus what can be garnered for listing is limited to:-

  • PageNumber (First page is 0 i.e all page numbers need +1)
  • RGB colour Value such as color="#004DE6" (not words like yellow, not sure where the alpha / opacity value is stored separately)
  • Page Relative Co-Ordinates (Rects)
  • Metadata such as Type of annotation and ID, Reviewer and date added (perhaps inreplyto=<coded ID of other comment>)
  • or Reviewer supplied data like subject / comments (May not be the same as page contents in the Rect)

There are two prescribed methods for programmable extractions in several ways but are not of value to a human reader as they are designed to EX-port the overlay the norm is FDF and the xml version is XFDF but they then need conversiion into Json / XML using complex decoding as an example the xml one for a few comments would run to pages and look like

<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<f href="../../../MyData/out4.pdf"/>
<ids original="C15E80D5BDF0828DC94C16477298D2DF" modified="7FFD0396F1FFF203EB9161E04C09935B"/>
<annots>
<ink page="0" flags="print" name="2d66392a-0430-4d28-815f6c00f6404f8a" rect="327.053986,595.205017,595.440002,771.515991" color="#004DE6">
<inklist><gesture>461.785004,761.403992;461.785004,761.531006;461.785004,761.828003;461.785004,762.286987;461.657013,762.877014;461.359985,763.559021;460.901001,764.171021;460.183014,764.771973;459.075012,765.392029;457.705994,765.916992;456.183014,766.432983;454.585999,766.982971;452.970001,767.585022;451.234985,768.111023;449.480011,768.63></inklist>
</ink>
<freetext intent="FreeText" IT="FreeText" title="K" page="0" date="D:20210620030505Z" flags="print" name="a104f142-9195-4156-85844ebed3dc4daa" rect="359.740875,605.969116,559.740845,705.969116" width="0">
<contents>Hello World!</contents><defaultappearance>/Helv 30 Tf 1 0 0 rg</defaultappearance></freetext>
<highlight coords="39,674.669983,198,674.669983,39,661.669983,198,661.669983,18,662.669983,306,662.669983,18,649.669983,306,649.669983,18,650.669983,104,650.669983,18,637.669983,104,637.669983" title="K" page="0" date="D:20210620210504Z" flags="print" name="0631dd8c-bae4-46f5-b1841f9c9e5d16e3" rect="14.935769,636.857483,309.06424,675.482483" color="#FF00FF"/>
<text icon="Comment" inreplyto="2d66392a-0430-4d28-815f6c00f6404f8a" title="K" creationdate="D:20210910212148+01'00'" page="0" date="D:20210910212148+01'00'" flags="hidden,print,nozoom,norotate" name="3e3d1483-529e-4b7c-b6c64eeda95dd864" rect="100,102,120,120" color="#FFFF00"/>
</annots>
</xfdf>

So not human readable XML as one might expect

GitHubRulesOK avatar Sep 11 '21 12:09 GitHubRulesOK

@GitHubRulesOK In such case it's best to keep the "simple text format" simple (I mean no need for coordinates, rgb value or annotation type)

Something like

---
> quote here
(p. 123)
---

Will work both for .txt and .md.

ribtoks avatar Sep 12 '21 09:09 ribtoks

@ribtoks again i agree with the sentiment keep it stupidly simple, however experience of others desires suggests the XY position within a page of multiple entries may aid in back searching such as used by LaTeX synctex or other programmable recall so goto highlight on page 10 half way down is

SumatraPDF -page 10 -zoom "fit width" -scroll 50,500 -reuse-instance MyFavorite.pdf

so colour export is of less value compared to rect upper left which is desirable

GitHubRulesOK avatar Sep 12 '21 12:09 GitHubRulesOK

@GitHubRulesOK My opinion is that XY coordinates might work well in a structured format like json. For simple text (human use) there's no point to provide XY - nobody will calculate on their own where is the highlight.

ribtoks avatar Sep 12 '21 14:09 ribtoks

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

To extract annotations:

  • drop a PDF file on the gray area
  • click 'extract annotations'
  • when it's done you can see JSON and text version in a text area below
  • when you switch between the version, they are also copied to clipboard so that it's easy to Ctrl-V into a text editor

Current limitations:

  • for highlight / underline etc. annotations there is no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself. it's possible to recover it, so I'll try to add it, but not today

Give it a try and let me know how it can be improved.

It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.

JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.

kjk avatar Sep 13 '21 01:09 kjk

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

To extract annotations:

  • drop a PDF file on the gray area
  • click 'extract annotations'
  • when it's done you can see JSON and text version in a text area below
  • when you switch between the version, they are also copied to clipboard so that it's easy to Ctrl-V into a text editor

Current limitations:

  • for highlight / underline etc. annotations there is no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself. it's possible to recover it, so I'll try to add it, but not today

Give it a try and let me know how it can be improved.

It's very easy to build different text formats (see https://github.com/sumatrapdfreader/sumatraonline/blob/master/www/exportpdfannotations.html#L121 for the current) so I'm open to implementing several different versions of text output.

JSON output has all the information that PDF exposes, so is good for processing by code or writing custom transformations to text.

Amazing!Love you!Can it build in program and display the annotations once upon pdf and annotation json file loaded in program?

AlexShyXie avatar Sep 13 '21 10:09 AlexShyXie

no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself

That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V

I suggest SHIFT+A or another key pair could auto include the text

GitHubRulesOK avatar Sep 13 '21 11:09 GitHubRulesOK

I like the extra info in the JSON but too complex to use simply to parse a rect and conversely the txt output des not give any clue as to annotations place on a page which may often not be in -Y order

GitHubRulesOK avatar Sep 13 '21 11:09 GitHubRulesOK

no text that is highlighted. Turns out that this information is not necessarily recorded in the annotation itself

That was changed from working? by a user request it not be included using A but manual by user using SHIFT+A CTRL +V

This is the whole point of this feature - to have the text that was highlighted by default. Otherwise it makes absolutely no value to have only the coordinates of annotations.

ribtoks avatar Sep 16 '21 11:09 ribtoks

I started working on this. Currently it's at https://sumatra-online.onrender.com/exportpdfannotations

@kjk would you consider open sourcing this? I would love to be able to use it locally. Thanks!

kings2u avatar Jun 04 '23 18:06 kings2u

@kings2u As a html function its complex Google reputedly liberated SUN JS from Oracle and the core PDF handling is * Copyright 2021 Mozilla Foundation * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 but then again parts are MIT and Parts are not, FOSS is a minefield due to CopyWrongs

GitHubRulesOK avatar Jun 04 '23 19:06 GitHubRulesOK

I’ve skimmed this thread but I’m unsure my usecase is covered so here it is: I have a PDF of a RPG rules. Some errata are issued by the creator. I annotate my PDF with the erratas and want to share my annotations with the community, but I’m not allowed to share the PDF, obviously.

So I’d like to export my annotations and with someway for other owners of the same PDF to import my annotations to have the erratas in their own copy.

Geobert avatar Sep 25 '23 13:09 Geobert

@Geobert The export of annotations e.g. comments between users is called collaborative review or similar, Adobe are masters at providing corporate solutions that cover their products, but Reader is able to do that when backed up by more powerful Adobe suites.

A good editor for exporting comments is Tracker PDFXedit and even Foxit reader may have similar abilities, but neither may offer all Adobes review features.

For simple edits such as text you can export an FDF file with just the comments and a user can add their master copy as text then open the FDF will over stamp the PDF.

let me see if i can mock up an example.

GitHubRulesOK avatar Sep 25 '23 14:09 GitHubRulesOK

@GitHubRulesOK Thanks for your answer! I tried PDFXedit, it exports the comments fine into an FDF but when opening this FDF, it wants to open the annotated version. The import button is grayed out.

Geobert avatar Sep 25 '23 14:09 Geobert

@Geobert

So I PDF this book :-) and add annotation to my copy then export as FDF

image

here is the file with the name of my copy

%FDF-1.4
%âãÏÓ
1 0 obj
<<
/FDF <<
/Annots [2 0 R]
/F (2beExportImported.pdf)
/UF (2beExportImported.pdf)
>>
/Type /Catalog
>>
endobj
2 0 obj
<<
/BS <<
/Type /Border
/W 1
>>
/Contents (This is a text... written in SumatraPDF as a demonstration)
/DA (/Helv 12 Tf 0 0 0 rg)
/F 4
/M (D:20230925140856Z)
/NM (ed50503f-4305-4e5a-acbc1c4266d7fb78)
/Page 0
/Rect [393.37059 559.2384 593.3706 659.2384]
/Subtype /FreeText
/T (lez)
/Type /Annot
>>
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF

so we edit to what we expect a user copy to be lets say it is "import.pdf" image and send it to you (in reality Acrobat reader does not change the name as we would be using same filename)

So I open the FDF by double click or Acrobat file open (SumatraPDF based on MuPDF does not have that ability)

Acrobat reader says that the FDF wants to write over the name i supplied "import.pdf"

image image

GitHubRulesOK avatar Sep 25 '23 14:09 GitHubRulesOK

Oh I see, one needs to edit the FDF! Thanks!

EDIT: the PDF is locked for edition, so it seems it can’t be done. I though it could because we can annotate such locked PDF, but the import of annotation seems to not work :-/

Geobert avatar Sep 25 '23 14:09 Geobert

@Geobert hmm locked is an problem (ensure file is not in use) in many ways by Adobe (protection is worthless in other readers) and most main stream FDF apps will often be compatible with Adobe restrictive DRM practice! So Users would need to use an unlocked copy (plenty of web sites charge or free for the service) MuPDF and other tools such as qpdf can remove the restrictions easily. but it rewrites the source file. Not a problem for FDF as its overlay on page numbers.

GitHubRulesOK avatar Sep 25 '23 15:09 GitHubRulesOK