zotero-better-bibtex icon indicating copy to clipboard operation
zotero-better-bibtex copied to clipboard

Request: endpoint to find parent item key given an attachment key

Open mhucka opened this issue 3 years ago • 7 comments

This is a follow-up to @retorquere's suggestion to open an issue related to the discussion linked below.

A basic use-case is the following: given a path to an attachment stored on disk in a user's local Zotero database, how can I find out the corresponding Zotero record to which the attachment belongs? According to information from the Zotero developers, the folder containing the PDF file is named after the attachment key, meaning that a path of the form ..../storage/PPMJQGRI/file.pdf, the folder name PPMJQGRI is the "attachment key". This attachment key is not the same as the item key for the record, but since BBT has the information about the attachments associated with each record, it may be possible to do a reverse lookup to find the item given an attachment key.

In case it's relevant, I want to mention that my ultimate goal is to update Zowie to use BBT when possible, instead of what Zowie does now, which is to contact the Zotero servers to do the lookup. (This is implemented using Pyzotero.) The current process is quite slow. For users who already have BBT, it would probably be a massive speedup to do the lookup using BBT instead, and would have the additional advantage of working when the user isn't connected to the internet.

Discussed in https://github.com/retorquere/zotero-better-bibtex/discussions/1953

Originally posted by mhucka October 11, 2021 I have successfully used the incredibly useful BBT json-rpc interface for a number of operations, but cannot figure out how to do the following: given an attachment key, can I find out the item key for the Zotero record that contains that attachment?

More specifically, following the naming scheme used by the Zotero devs, if I'm looking at a PDF file on disk in my Zotero storage area, and this file has a path of the form ..../storage/PPMJQGRI/file.pdf, the folder name PPMJQGRI is the "attachment key". Using Pyzotero, it's possible to look up the parent record key, which I do in a program I wrote called Zowie. Now I'm trying to see if I can use BBT to get this information instead of using Pyzotero. It seems like BBT has everything needed, but I can't find a suitable rpc endpoint to make it work. The closest I can come is to export the entire database in (say) "jzon" format and then look for the file in the result, but this is much too slow. Is there a more direct way?

Support log ID:

mhucka avatar Oct 13 '21 18:10 mhucka

Hi – I was trying to think of what would be useful as a method name for this, based on the examples on the JSON-RPC page, and also the code in json-rpc.ts. To go from attachment key to parent item key, it seems like it wouldn't be a method on item/NSItem. Perhaps it's something that would be an extension of collection/NSCollection? Or else, perhaps the start of a new class for attachment-oriented operations, something like NSAttachments and a method along the lines of attachments.itemkey(attachmentKey).

mhucka avatar Oct 20 '21 04:10 mhucka

Attachments are items, so that would be appropriate. Sorry for the long silence -- swamped at the moment. I'd be OK with a get that would get an item + attachments when it's a top-level item, or item + parent when it's not. It could just be the itemToExportFormat result for the item with an attachments array (if itemToExportFormat doesn't already add it), and with an parent property holding the itemToExportFormat for the parent otherwise. The itemToExportFormat result will have the citekey patched in automatically.

retorquere avatar Oct 22 '21 21:10 retorquere

Thanks for the feedback. I'm not sure if you're envisioning the get function would take an attachment key as a value. In the situation I'm grappling with, all I have is a file path (and as a consequence, an attachment key, which is part of the file path), and not item keys or parent keys. The following part of the previous comment,

a get that would get an item + attachments when it's a top-level item, or item + parent when it's not

sounds like it's referring to using item keys as the argument to the get, which is not what I'm hoping it will be. But maybe I'm misunderstanding what's intended?

(PS. No need to apologize for delays – I have projects for which my response time can be measured on the scale of geological epochs. By comparison, your work on this project is faster than lightning, and I'm quite thankful!)

mhucka avatar Oct 25 '21 03:10 mhucka

Thanks for the feedback. I'm not sure if you're envisioning the get function would take an attachment key as a value. In the situation I'm grappling with, all I have is a file path (and as a consequence, an attachment key, which is part of the file path), and not item keys or parent keys. The following part of the previous comment,

Sorry for being unclear -- in Zotero, attachments (and notes) are just a specific type of item, and all items have an ID and a key. If you have the key, you can get the item (sort of -- see explanation of item keys below), of which you could learn after getting it what type it is.

sounds like it's referring to using item keys as the argument to the get, which is not what I'm hoping it will be. But maybe I'm misunderstanding what's intended?

No, this part of the Zotero terminology isn't widely known/used, you have to know the internals of Zotero fairly well to know this.

There's another thing that you need to know when dealing with itemIDs/itemKeys. Sorry for the convoluted text, this is the best I can explain it.

Zotero is primarily a local, non-networked tool that also syncs. You can use Zotero perfectly fine without ever syncing. The reason that this is important is for understanding how Zotero identifies libraries and items.

Your private library and the groups you have/are a member of are all collectively known as "libraries" to your local install. All of these have a local libraryID, and in the case of your personal library it has a well-known, same-across-all-users ID if you have never synced. If you sync your personal library, you are (sort of. I'm simplifying here to to sketch how Zotero works) assigned a second ID (let's call this the groupID) which is used in sync. When you later sync in more groups, these are libraries to your local install, and they will have a local libraryID, and a groupID used for sync. If you sync, those synced libraries/groups will share their groupID, but crucially not their (local) libraryID. When you sync in a group for the first time, it is assigned a new, sort of random, local libraryID, which is unique within that installation.

Items have a similar setup. Each has an itemID, which is only meaningful locally. This itemID identifies an item across all items in your local install. If you sync your own data to another PC, they will have different itemIDs. ItemIDs are only meaningful locally.

Additionally, primarily for the purposes of syncing your items, each item also has an itemKey (that's the thing you are talking about above) which is unique within the library that the item belongs to. They are not necessarily unique across your libraries. Internally, Zotero addresses items by either itemIDs or libraryID + itemKey. It doesn't happen often, but I have seen instances where two items, one in a different library each, had the same itemKey. Group IDs are globally unique, so groupID + itemKey is globally unique.

So, the point I was building towards: if you have just the itemKey, that isn't technically enough to identify an item, even if it can usually be used that way.

The part that's now suddenly fuzzy to me is how Zotero lays this out on disk. I don't recall the Zotero client adding anything to the path that would point to the library/groupID. I don't currently have synced groups to test with.

retorquere avatar Oct 25 '21 09:10 retorquere

Thanks for the detailed explanation! I understood some of that already, but not all of it. This will be useful as a reference.

This is disturbing news, though:

Additionally, primarily for the purposes of syncing your items, each item also has an itemKey (that's the thing you are talking about above) which is unique within the library that the item belongs to. They are not necessarily unique across your libraries. Internally, Zotero addresses items by either itemIDs or libraryID + itemKey. It doesn't happen often, but I have seen instances where two items, one in a different library each, had the same itemKey.

I did not realize that given an attachment key/item key like 22YG7R8A extracted from a file pathname like ~/Zotero/storage/22YG7R8A/The best paper ever.pdf, there might be more than one library to which it belongs. Currently, what I do in Zowie (using pyzotero) is to iterate over a list of the user's libraries looking for the combination that returns a record. The algorithm is pretty trivial and embodied in this function (but the code is a bit verbose because of error checks and debug logging ):

https://github.com/mhucka/zowie/blob/0e785541e6381a6ef4c5dee73bb17da0570ecbce/zowie/zotero.py#L124-L171

The implication of what you're saying is that there might be more than one library containing the same key, which means my simplistic approach is nondeterministic and will return the first one it finds at random without informing the user there might be another. I will need to document this, and maybe add an option to control the behavior somehow (perhaps by adding rules for the precedence of libraries chosen).

Back to the BBT API: with all this in mind, I now understand that it's necessary to give it both a library id and an item/attachment key. If the new endpoint could return something that identifies the parent record, that would be great. Since there is already a BBT API endpoint to get the list of the user's libraries, a caller could take an item key and iterate over the libraries, trying out different combinations until it finds one that works, or fails to find any matches. That's the approach I'm already taking in the context of Zowie, and it's been working (notwithstanding the just-discovered problem of nonuniqueness).

mhucka avatar Oct 25 '21 14:10 mhucka

I did not realize that given an attachment key/item key like 22YG7R8A extracted from a file pathname like ~/Zotero/storage/22YG7R8A/The best paper ever.pdf, there might be more than one library to which it belongs.

This is not going to be the case. I don't know how Zotero lays out synced groups, but it cannot be the case that that path belongs to more than one library. If you have synced groups, the attachments must be identifiable in some way.

The implication of what you're saying is that there might be more than one library containing the same key, which means my simplistic approach is nondeterministic and will return the first one it finds at random without informing the user there might be another.

The implication is that I'm pretty sure it's possible to find out what library an attachment belongs to.

retorquere avatar Oct 25 '21 18:10 retorquere

The problem exists, technically, but it's deemed so unlikely to occur that we ought not worry about it: https://groups.google.com/g/zotero-dev/c/Wx86Jf3AL9s/m/HXs3wrCtBQAJ

retorquere avatar Nov 15 '21 14:11 retorquere