polar-bookshelf icon indicating copy to clipboard operation
polar-bookshelf copied to clipboard

Zotero Library Integration

Open dainius-sileika opened this issue 5 years ago • 34 comments

Many students and researchers do most of their academic reading on the computer.

Zotero is an open-source citation manager and library, but it isn't the best for reading; Polar Bookshelf is an open source library manager that is great for reading but isn't built for academic citations.

There is presently no elegant way to satisfy the needs of students and researchers.

Please allow for Polar Bookshelf to read directly from the Zotero pdf library; extracting highlights with correct page numbers would be a big bonus, as presently, there is no way to extract PDF highlights and annotations from PDFs in Zotero.

dainius-sileika avatar Nov 25 '18 11:11 dainius-sileika

Does the Zotero system already do highlights? What if we just imported the whole thing into Polar?

I assume you need 24/7 constant sync with Zotero as you manage your citations?

Maybe this should be one of the first plugins...

burtonator avatar Nov 25 '18 16:11 burtonator

Zotero does not, but there's a very popular zotfile plugin that reads highlights using an old PDF.js version and then extracts them into a text file beside the original PDF, so that one's workflow could be read, highlight, extract citations, and then write your paper or make your Anki flashcards; I don't know anyone that uses Zotero that doesn't use the zotfile plugin.

Where I see Polar Bookshelf coming in is providing a reading and citing platform for one's Zotero library; if Polar Bookshelf could read the Zotero library without moving anything, and furthermore allow for highlighting and extraction of citations, it would make it a student's dream for both paper writing and anki flashcard creation.

The problem with the Zotfile plugin is that the old PDF.js version that it uses doesn't read PDF page labels, which means that once you extract your citations, the page numbers in the text file are off by however many pages the PDF is off, and so you have to always add or subtract before you can cite.

To sum up, the specific features that would be great are:

  1. Interface with Zotero libary;
  2. Extract or otherwise keep track of PDF highlights of a given document, with correct page numbers as per the PDF page labels, so that students can cite of make flashcards.

The Zotero community is huge, and it would be beneficial for the Polar Bookshelf project if it became the go-to reading software for Zotero users.

I haven't tried your Anki implementation yet, but the very idea is very exciting, however, the nuances of that would be a later request (2) or (30.

Cheers!

-D

dainius-sileika avatar Nov 25 '18 19:11 dainius-sileika

Interesting. Thanks for the feedback. I appreciate it.

The biggest problem with PDF annotations is that the annotation formats are horrible which is why I did something completely custom for Polar.

I don't think I would be able to keep the annotations with 3rd party PDF readers in-sync with Polar. The idea is to just make Polar the standard and then go with that. It's a longer term goal but he current PDF annotation standards are unworkable.

What if I just built a way to automatically use Polar from within Zotero via Zotfile... I could just write a plugin that would automatically import the PDF or open it if it's already imported.

Zotfile is OSS so I could just submit a PR for that functionality.

So basically if I have the full citation information attached to a PDF with every comment/annotation then we're golden. And I could just interface with that via zotfile, correct?

I don't regularly contribute to academic papers which require citations so I'm trying to understand the workflow here.

burtonator avatar Nov 25 '18 19:11 burtonator

... and actually could you jump on the discord? I'd like to skype voice about this if you're interested... I want to understand the use case 100% and a 20 minute skype call really helps - assuming you have the time. Don't want to burden you. Already appreciate your feedback here.

burtonator avatar Nov 25 '18 19:11 burtonator

Discord link: https://discord.gg/GT8MhA6

burtonator avatar Nov 25 '18 19:11 burtonator

So basically if I have the full citation information attached to a PDF with every comment/annotation then we're golden. And I could just interface with that via zotfile, correct?

This would be very interesting!

mesalas avatar Nov 26 '18 13:11 mesalas

I do not know what has become of this but I would very much like this feature as well.

Been a fan of the Polar and the only thing holding me back from jumping into fully using Polar is this very issue.

Right now, my personal knowledge base with Zotero has been using the zotfile mentioned by @dainius-sileika to extract annotations from PDFs to Zotero, Foxit Reader for PDF reading and analysis, and Chromium based web browsers to snag webpages of interest.

Happy to discuss features a bit more @burtonator .

TheCedarPrince avatar Jan 11 '19 00:01 TheCedarPrince

I think the major challenge here is which is the single source of truth?

Does Zotero do anything magic when finding the title of the PDFs? The pdf.js library has support for titles and other metadata extraction but it doesn't seem to find many documents with titles? Does it maybe use some API service to look them up online?

Are you planning on still using Zotero?

Lots of questions but you can catch me on the discord and can chat about it via voice if you're game.

burtonator avatar Jan 11 '19 02:01 burtonator

https://www.zotero.org/blog/zotero-5-0-36/

... good review of what Zotero is doing...

burtonator avatar Jan 11 '19 02:01 burtonator

also, it would be helpful if you could explain what metadata fields you need in Polar and maybe a copy of some of the data that you find valuable in Zotero.

Still not sure how we're going to handle the single source of truth problem. If you delete a doc in Polar what happens with it in Zotero?

burtonator avatar Jan 11 '19 02:01 burtonator

https://forums.zotero.org/discussion/26276/retrieving-metadata

burtonator avatar Jan 11 '19 03:01 burtonator

Zotero tries to identify PDFs by looking for a DOI in the first couple pages. If a DOI is not found, it picks a long string of consecutive words and searches Google Scholar. If Google Scholar returns any results, the first result is imported. (You're likely getting stuck here, because Google Scholar thinks that you're a script, which in this case you are, and blocks you) Otherwise, PDF metadata retrieval fails. In an upcoming Zotero release, metadata retrieval will also look for ISBN numbers as well (for books). I'm not sure how else we can go about figuring out what the PDF is.

burtonator avatar Jan 11 '19 03:01 burtonator

Howdy,

So I've been thinking a bit more on this topic. Here are my thoughts:

  1. Both you and Zotero are competing for subscribers. In this sense, it's a zero-sum game, and that's too bad. But, free market, right? One way to integrate with Zotero would be to read the Zotero library, import PDFs and documents into Polar, but leave symbolic links in the gutted Zotero library so that if one wanted to open a file through Zotero, one still could.

  2. What Zotero users are really looking for is probably more of a reader replacement. Once Zotero users start to use Polar exclusively, maybe they'll buy into your cloud, too. Again, here are some points that we discussed, but in one place:

a. Ability to export highlighted text, with correct page numbers, like Skim or other similar readers; b. Flashcards - amazing, although I'm not using that feature in Polar, yet; c. incremental reading - this is where Polar could really shine. Imagine:

You read a PDF, which is in focus in the main window, and make highlights; Those highlights appear in the side window, out of focus; You click on the side window, with the highlight extracts, bring it into focus; The original PDF disappears, the highlight window becomes the main window; A new out of focus window appears on the right for secondary highlights; You do a second highlighting pass on the first pass highlights; Secondary highlights appear in the out of focus right hand window; The process continues recursively, allowing for third level, fourth level, etc, pass highlighting; Eventually, the text is paired down to the essentials, and is ready for flashcard creation.

Does that make sense? Essentially, I'm arguing for the ability to go higher and higher resolution, with several "layers" of highlights as you read a text.

We can chat on discord sometime if this sounds interesting.

-D

On Fri, Jan 11, 2019 at 5:07 AM Kevin Burton [email protected] wrote:

Zotero tries to identify PDFs by looking for a DOI in the first couple pages. If a DOI is not found, it picks a long string of consecutive words and searches Google Scholar. If Google Scholar returns any results, the first result is imported. (You're likely getting stuck here, because Google Scholar thinks that you're a script, which in this case you are, and blocks you) Otherwise, PDF metadata retrieval fails. In an upcoming Zotero release, metadata retrieval will also look for ISBN numbers as well (for books). I'm not sure how else we can go about figuring out what the PDF is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/burtonator/polar-bookshelf/issues/427#issuecomment-453359811, or mute the thread https://github.com/notifications/unsubscribe-auth/AqhS8RJ43Jgad2v0waImDn3hw4ozybCJks5vB__4gaJpZM4Yx1cT .

dainius-sileika avatar Jan 11 '19 10:01 dainius-sileika

Addendum:

To clarify, multi-pass incremental reading is what I mean, since I get that Polar's point is to do incremental reading already.

On Fri, Jan 11, 2019 at 12:11 PM Dainius Sileika [email protected] wrote:

Howdy,

So I've been thinking a bit more on this topic. Here are my thoughts:

  1. Both you and Zotero are competing for subscribers. In this sense, it's a zero-sum game, and that's too bad. But, free market, right? One way to integrate with Zotero would be to read the Zotero library, import PDFs and documents into Polar, but leave symbolic links in the gutted Zotero library so that if one wanted to open a file through Zotero, one still could.

  2. What Zotero users are really looking for is probably more of a reader replacement. Once Zotero users start to use Polar exclusively, maybe they'll buy into your cloud, too. Again, here are some points that we discussed, but in one place:

a. Ability to export highlighted text, with correct page numbers, like Skim or other similar readers; b. Flashcards - amazing, although I'm not using that feature in Polar, yet; c. incremental reading - this is where Polar could really shine. Imagine:

You read a PDF, which is in focus in the main window, and make highlights; Those highlights appear in the side window, out of focus; You click on the side window, with the highlight extracts, bring it into focus; The original PDF disappears, the highlight window becomes the main window; A new out of focus window appears on the right for secondary highlights; You do a second highlighting pass on the first pass highlights; Secondary highlights appear in the out of focus right hand window; The process continues recursively, allowing for third level, fourth level, etc, pass highlighting; Eventually, the text is paired down to the essentials, and is ready for flashcard creation.

Does that make sense? Essentially, I'm arguing for the ability to go higher and higher resolution, with several "layers" of highlights as you read a text.

We can chat on discord sometime if this sounds interesting.

-D

On Fri, Jan 11, 2019 at 5:07 AM Kevin Burton [email protected] wrote:

Zotero tries to identify PDFs by looking for a DOI in the first couple pages. If a DOI is not found, it picks a long string of consecutive words and searches Google Scholar. If Google Scholar returns any results, the first result is imported. (You're likely getting stuck here, because Google Scholar thinks that you're a script, which in this case you are, and blocks you) Otherwise, PDF metadata retrieval fails. In an upcoming Zotero release, metadata retrieval will also look for ISBN numbers as well (for books). I'm not sure how else we can go about figuring out what the PDF is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/burtonator/polar-bookshelf/issues/427#issuecomment-453359811, or mute the thread https://github.com/notifications/unsubscribe-auth/AqhS8RJ43Jgad2v0waImDn3hw4ozybCJks5vB__4gaJpZM4Yx1cT .

dainius-sileika avatar Jan 11 '19 10:01 dainius-sileika

@dainius-sileika

Following up with your points below:

#1.

One way to integrate with Zotero would be to read the Zotero library, import PDFs and documents into Polar, but leave symbolic links in the gutted Zotero library

It works fine with hard links on Windows so I might use those... it's basically a copy of the same file with no extra space used.

the main issue I have is deletes. You would delete it in Polar but then it would/could see the file again at some future time and re-import it. I guess I could implement a 'tombstone' idea so that once it's deleted from Polar it would never re-appear but then I'd need to implement a way to manage /purge the tombstones.

Zotero support 3rd party readers right? Maybe that's just a way to integrate with it... Just double click it in Zotero and it's opened via Polar and Polar auto-imports it...

#2.. are you talking about highlights on the highlights? Sort of like annotations on annotations? Yo dawg! The annotation manager is sort of designed for that and you can (technically) have comments on highlights.

burtonator avatar Jan 11 '19 21:01 burtonator

I think for Zotero I might support some sort of one way import for now.. The user can just run it periodically and copy all the metadata into Polar.

If anyone could give me a copy of your Zotero data or an export that would be helpful. It might be better to read the live data though.

Maybe auto-detect Zotero on startup.

burtonator avatar Jan 11 '19 21:01 burtonator

re: Incremental reading,

The way I use it in real life is, I highlight what looks important, but I don't always know for certain since I don't have the big picture yet--so essentially I'm trimming the fat on the first pass of highlights.

Second pass highlighting--highlighting on highlights--is where I'm more aggressive, because now I have a better idea of what's actually important.

IRL, I use skim right now to do first past highlighting, extract to text, print to PDF, then use skim to highlight and do that second pass, at which point i'm almost at flashcard level.

Sometimes, if the text is super dense, i'll have to print to PDF a second time, and do level 3 highlights.

Does that make sense?

On Fri, Jan 11, 2019 at 11:19 PM Kevin Burton [email protected] wrote:

@dainius-sileika https://github.com/dainius-sileika

Following up with your points below:

#1 https://github.com/burtonator/polar-bookshelf/issues/1.

One way to integrate with Zotero would be to read the Zotero library, import PDFs and documents into Polar, but leave symbolic links in the gutted Zotero library

It works fine with hard links on Windows so I might use those... it's basically a copy of the same file with no extra space used.

the main issue I have is deletes. You would delete it in Polar but then it would/could see the file again at some future time and re-import it. I guess I could implement a 'tombstone' idea so that once it's deleted from Polar it would never re-appear but then I'd need to implement a way to manage /purge the tombstones.

Zotero support 3rd party readers right? Maybe that's just a way to integrate with it... Just double click it in Zotero and it's opened via Polar and Polar auto-imports it...

#2 https://github.com/burtonator/polar-bookshelf/issues/2.. are you talking about highlights on the highlights? Sort of like annotations on annotations? Yo dawg! The annotation manager is sort of designed for that and you can (technically) have comments on highlights.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/burtonator/polar-bookshelf/issues/427#issuecomment-453659842, or mute the thread https://github.com/notifications/unsubscribe-auth/AqhS8U1VfBh6B4j4rC4v1SfjS7vy6o9iks5vCP_dgaJpZM4Yx1cT .

dainius-sileika avatar Jan 11 '19 22:01 dainius-sileika

What if I supported ranking the highlights... and prioritized view of the highlights?

This way you could view highlights by position on the paper, ranking, and maybe also the time the highlight was created (though not sure the value of that one).

... and we could support export based on the current view. This way you could export sorted by ranking desc.

burtonator avatar Jan 12 '19 18:01 burtonator

... regarding zotero... do you need just the title or do you need the abstract too?

I think I can get a title somewhat easily.

burtonator avatar Jan 12 '19 18:01 burtonator

Ranked highlights would do the trick - however, I guess what would be key then is being able to "over highlight" other highlights without the original ones disappearing?

Re: Zotero, honestly, the most important thing is to just get page numbers in the extracted highlights and maybe document title, I don't see it being that useful to go any deeper, as most Zotero users would be probably using Polar as a reader/highlighter.

On Sat, Jan 12, 2019 at 8:55 PM Kevin Burton [email protected] wrote:

... regarding zotero... do you need just the title or do you need the abstract too?

I think I can get a title somewhat easily.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/burtonator/polar-bookshelf/issues/427#issuecomment-453772338, or mute the thread https://github.com/notifications/unsubscribe-auth/AqhS8Vc8LQOxUfHvnSZ5K1orDCxvZYhuks5vCi-ogaJpZM4Yx1cT .

dainius-sileika avatar Jan 12 '19 19:01 dainius-sileika

I too quite like the "second pass highlighting" method. On a second pass I'll review all my annotations-only and then change some of those highlights to a different colour to indicate this is the "truly essential" highlights.

I'm sure with the flexible JSON format used there could be a way to tag these highlights with a custom label e.g. "first-pass" or "priority 1" or something.

ajvsol avatar Jan 25 '19 14:01 ajvsol

Do you guys know if Zotero mutates the PDFs and adds the annotations directly to them or does the hashcode stay the same?

If the hashcodes never change that will make this easier.

burtonator avatar Jan 26 '19 14:01 burtonator

Would anyone with a large PDF repo of research papers give me a tar.gz of all their PDFs? That would be helpful.

I'm working on a parser to find all the metadata and I"m going to use them to extract the metadata and find the most frequently used metadata fields.

Given 1-5GB this will give me a good sample of that's ACTUALLY used.

I will keep the repo locally but won't share it.. I'm going to build this up over time so I can see what fields I can actually use in production.

burtonator avatar Jan 26 '19 15:01 burtonator

There's quite a few websites with free access to research paper PDFs, here's one search engine.

ajvsol avatar Jan 26 '19 15:01 ajvsol

Zotero itself doesn't change the PDFs at all, it's just a library without a reader. We want to use your reader with Zotero :)

On Sat, Jan 26, 2019 at 4:55 PM Kevin Burton [email protected] wrote:

Do you guys know if Zotero mutates the PDFs and adds the annotations directly to them or does the hashcode stay the same?

If the hashcodes never change that will make this easier.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/burtonator/polar-bookshelf/issues/427#issuecomment-457837403, or mute the thread https://github.com/notifications/unsubscribe-auth/AqhS8fnamMSDoTm5XPpuqJcuDiPF13Hsks5vHGxXgaJpZM4Yx1cT .

dainius-sileika avatar Jan 26 '19 17:01 dainius-sileika

@burtonator I could share my dataset with you (but no earlier than Monday). As far I know Zotero doesn't modify the pdf metadata. Zotero doesn't only use pdf metadata - it also relies upon web service to retrieve author and title whe the file doesnt contain it https://www.zotero.org/blog/zotero-5-0-36/.

danieltomasz avatar Jan 26 '19 18:01 danieltomasz

Thanks. Monday is fine.

I might add support for unpaywall and other services which supports metadata retrieval via DOI so I want to see how many documents support DOI in the metadata.

burtonator avatar Jan 26 '19 22:01 burtonator

link to my dropbox library [deleted] I will delete it when you download it

danieltomasz avatar Jan 27 '19 14:01 danieltomasz

@danieltomasz thanks! Downloaded!

burtonator avatar Jan 27 '19 15:01 burtonator

looks like based on your metadata 38% of your docs have DOIs... 82% had titles which is pretty good. Better than I thought. However, the DOI information is very valuable so I'll probably add support at some point to resolve them.

burtonator avatar Jan 27 '19 16:01 burtonator

Just to get an update on your discussion. What are the prospects?

bepolymathe avatar May 11 '20 09:05 bepolymathe

So, what is the progress regarding this feature?

abdalazizrashid avatar May 22 '20 09:05 abdalazizrashid

Any news here?

fishhead108 avatar Jan 10 '21 20:01 fishhead108

It looks like we can integrate this but I think we might have to fork the library to get this done. We also have to push it as a module into our own repo so it's a lot of busy work.

Then we need to test it and integrate it. I'm going to reassess because there might be a faster way but want to think about it first.

burtonator avatar Jan 10 '21 21:01 burtonator