Add `papis tag` command
This is a draft of a new papis tag command that allows adding/removing/clearing tags. There are some docs inside to show you what's possible.
This is mostly just a first working draft to get some feedback. Here are some questions/considerations I'm unsure about:
- Is this a useful thing to ship with papis? We already have
papis updatethat can do some of whatpapis tagcan do. Is there too much overlap? Should this maybe rather be an additional script? - What functionality would you like to see? Maybe it could be useful to apply all tags from one doc to another for instance...
- Better idea for the names/structure of the options? For instance I would have liked to use
--addfor adding tags, but the short version of that is-awhich is already used by--all. I've also considered using+TAGand-TAG, but haven't found a straightforward way of implementing this in python click (and it isn't consistent with other papis commands). - I'm not all that well versed with python, so any suggestions for improvements are very welcome.
- I've added a new config option for the
tag-format. I've made some assumptions regarding how people list their tags, namely that they're either a list, or a string separated by,,;or. I've considered allowing any arbitrary separators, but that's not very easy to set in the config because of the white space. I've also considered trying to detect the tag-format automatically (which I actually do inpapis.nvim), but opted against it as it isn't foolproof (which is more acceptable forpapis.nvimas it doesn't edit any documents).
Note that there might be bugs in here, so don't use with your actual library. Also, it requires explicitly setting the tag-format in your config.
EDIT: Bunch of errors in the CI, looks like my winging of the typing system hasn't quite gone right (and I also introduced a bunch of trailing white space when I wrote the explanations). :sweat_smile: I'll get to these!
Thanks a lot for the feedback, it's super useful!
It would also be helpful to list all the existing tags and maybe do some deduplication of some sort. For example, I have some tags with fluid-mechanics and fluid-dynamics (depending on how I felt at the time) and would love to merge them!
Yes, that's something I've been wanting as well. Do you think it might make sense to maybe implement it in papis doctor (but use the functionality we'll get from papis tag)?
Maybe having some subcommands like papis bibtex?
papis tag QUERY add TAG1 TAG1 remove TAG3 TAG4
Yes, this seems like a better way to do it. Thanks!
I think a more flexible option would be to have a tag-format-separator and use that if the tags field is a string. Maybe even just standardize on the tags being a list in the info.yaml file already? Then the separator would be just a fallback thing..
I think it would make sense to standardise things to lists, but then I'm using lists and I'm possibly biased :sweat_smile:. In any case, that would probably be something to do at a later point and with some conversion path. I've added some comments in the responses regarding why I've required explicitly setting the tag-format option. If there's a way to assume lists (except if tag-format-separator is set) without potentially leading to inconsistencies, I'm happy to do it. Let me know what you think!
Yes, that's something I've been wanting as well. Do you think it might make sense to maybe implement it in
papis doctor(but use the functionality we'll get frompapis tag)?
Yeah, that's a good idea! The duplicate-refs check is already very similar, so it should fit just fine.
I think it would make sense to standardise things to lists, but then I'm using lists and I'm possibly biased 😅. In any case, that would probably be something to do at a later point and with some conversion path. I've added some comments in the responses regarding why I've required explicitly setting the
tag-formatoption. If there's a way to assume lists (except iftag-format-separatoris set) without potentially leading to inconsistencies, I'm happy to do it. Let me know what you think!
Yeah, I think it also makes sense to standardize on lists. Then we wouldn't have to worry about the tag having whitespace in it, or having the separator as part of the tag for some reason, or any other issue like that.
Maybe we can have support for reading the tags in a variety of formats, but when we write them back it's always a list? That way users will slowly migrate to that format in all their documents.
It would also be helpful to list all the existing tags and maybe do some deduplication of some sort. For example, I have some tags with fluid-mechanics and fluid-dynamics (depending on how I felt at the time) and would love to merge them!
Yes, that's something I've been wanting as well. Do you think it might make sense to maybe implement it in
papis doctor(but use the functionality we'll get frompapis tag)?
Actually this should just work with
papis tag --add fluid-dynamics --remove fluid-mechanics QUERY
so maybe not a great suggestion :)) Guess this command is awesome already! :heart:
Actually this should just work with
papis tag --add fluid-dynamics --remove fluid-mechanics QUERYso maybe not a great suggestion :)) Guess this command is awesome already! ❤️
Actually, I was thinking to somehow search the library for suspiciously similar tags and then propose to merge them. Not sure that would work with your example, but I've got lots that just differ in capitalisation etc.
Yeah, I think it also makes sense to standardize on lists. Then we wouldn't have to worry about the tag having whitespace in it, or having the separator as part of the tag for some reason, or any other issue like that.
Maybe we can have support for reading the tags in a variety of formats, but when we write them back it's always a list? That way users will slowly migrate to that format in all their documents.
Or we could also tell people that papis tag only supports lists and give them the option of converting everything to lists. That way papis tag can be simple and we don't even need a tag-format option. We could have a papis tag --convert-from "SEPARATOR" (or so) command that converts everything...
@alejandrogallo, what do you think about standardising tags to lists? I'm asking because my suspicion is that you're using strings (as existing docs use them). Are there any use cases we're not considering for which lists aren't well suited?
Great new feature than this add/remove tag options. I support it.
@jghauser, what do you mean by "standardising tags to lists" ? Writing a list format "[tag1", "tag2 with space", ...] in the tag field of the yaml file (or formated like the author_list or citations fields...) ? I putting here a python list format but I don't know what it is the best way to store a list in a yaml file... I support the idea as soon as it can be easily parsed directly from the yaml file (with a tool like python json/yq/jq...).
BUT, what about the databases? I use the whoosh database personally, which let you perform complex searches like year:2020 AND tags:neolithique australia (this syntax here search for docs with tag 'neolithique' AND 'australia'), and I am wondering about the easiness of dealing with list instead of strings in the database. I think than if this change is implemented, it is important to modify the databases accordingly if needed to be sure it doesn't disable any feature the databases previously provided.
Thanks for the input @Twix53791!
@jghauser, what do you mean by "standardising tags to lists" ? Writing a list format
"[tag1", "tag2 with space", ...]in the tag field of the yaml file (or formated like the author_list or citations fields...) ? I putting here a python list format but I don't know what it is the best way to store a list in a yaml file... I support the idea as soon as it can be easily parsed directly from the yaml file (with a tool like python json/yq/jq...).
it would be like this:
tags:
- shared intentionality
- cognitive science
- developmental psychology
BUT, what about the databases? I use the whoosh database personally, which let you perform complex searches like
year:2020 AND (tags:neolithique OR tags:australia), and I am wondering about the easiness of dealing with list instead of strings in the database. I think than if this change is implemented, it is important to modify the databases accordingly if needed to be sure it doesn't disable any feature the databases previously provided.
I'm not entirely sure how this is handled (maybe @alexfikl can chime in), but it seems to work as expected. I'm also using the whooh database, and when i search for e.g. tags:"shared intentionality" it seems to give me all documents containing that tag.
I'm not entirely sure how this is handled (maybe @alexfikl can chime in), but it seems to work as expected. I'm also using the whooh database, and when i search for e.g.
tags:"shared intentionality"it seems to give me all documents containing that tag.
I think it's just stringifying the list, so it does a substring search in the string '["shared intentionality", "cognitive science", "developmental psychology"]' and it finds stuff as expected.
Well the format you propose is fine for me (I am telling that because I use yq to parse the papis yaml file in another program I am writing, a "fzf browser" for papis papis-fzf).
But I just tried a bit the whoosh search and I think the database, yes, stringifying the list like that (I checked directly the whoosh database) : ['cognitive science', 'intentionality']. But, about the search, it does not consider the list elements as one unique entity.
I tried with a document:
tags:
- cognitive science
- intentionality
- developmental psychology
All these searches match it:
papis list "tags:cognitive AND tags:intentionality"
papis list "tags:science AND tags:intentionality"
`papis list "tags:science AND tags:psychology"
It is not because it matches a part of the string, papis list "tags:scie doesn't match.
More problematic. Two entries:
tags:
- cognitive science
- intentionality
tags:
- science
- intentionality
- cognitive
Are both matched by "tags:science AND intentionality", worse by tags:cognitive science and even tags:science cognitive... I am not sure we can change that as it is probably a feature from whoosh?
Maybe not a big deal, but a bit dirty. At least people should be aware than multi-words tags are not safe. Personaly, I prefer at the end use separate tags "cognitive", "science", etc. or write with dashes "-" or "_" for spaces... Just to be sure than I am not gonna cite by error a reference while building a bibliography from a tag.
PS: Ah, I forgot there is not such a thing yet in papis, a citation styles manager like in zotero and the possibility to just select a tag and "exporting" (but with the citation style formatted) all the entries in less than a tenth of a second... Papis-fzf will fix that before we implement it to papis!
I'm not 100% sure, but it seems to me like all the queries you listed work as expected. I'll try go give some explanations below. If you're still not sure about it, maybe open a Discussion instead so we don't clutter this PR too much?
But I just tried a bit the whoosh search and I think the database, yes, stringifying the list like that (I checked directly the whoosh database) :
['cognitive science', 'intentionality']. But, about the search, it does not consider the list elements as one unique entity.
Yeah, that's correct, in stringifying the list, it should consider the elements all together, i.e. it no longer considers them separate tags, for better or worse.
I tried with a document:
tags: - cognitive science - intentionality - developmental psychologyAll these searches match it:
papis list "tags:cognitive AND tags:intentionality"papis list "tags:science AND tags:intentionality"papis list "tags:science AND tags:psychology"
Those are all expected to me, because the tags list contains all of those terms (although in different tags).
It is not because it matches a part of the string,
papis list "tags:sciedoesn't match.
I said "substring search" before, but that was just a shorthand on my side, I'm sorry. The whoosh backend uses the query language defined by the database here: https://whoosh.readthedocs.io/en/latest/querylang.html. You can also control what gets put in the database with whoosh-schema-prototype.
I'm not very familiar with it, but it seems like you need to say tags:"scie*" to get partial matches. The way that works may be a bit weird, but I'm not sure we have much control over it..
More problematic. Two entries:
tags: - cognitive science - intentionality tags: - science - intentionality - cognitiveAre both matched by
"tags:science AND intentionality", worse bytags:cognitive scienceand eventags:science cognitive... I am not sure we can change that as it is probably a feature from whoosh?
This might be because whoosh considers the space a separator and those are actually separate queries. Have you tried if tags:"cognitive science" vs tags:"science cognitive" gives the desired result?
Well, yes I think it is the normal whoosh way to handles such things. Personally, I don't like it, as in fact it create a difference with only-word tags. I terms of design, when an algorithm behaves differently according to non explicit parameters like the typo here, I think it is failed.
tags:
- cognitive science
- developmental psychology
tags:
- cognitive_science
- developmental_psychology
Both tags are the same for a human mind, but whoosh will behave differently in its search....
Anyway, it is also true than the list way doesn't interference with the one-word tags syntax. So it changes absolutly nothing for someone who doesn't use multiple words tags. It is just a positive adding of a new functionality, whitout interferences with previous fonctionality, so I think it is OK for everyone. You tested it with the native papis database also?
I'm finally getting some time to finish this off. I've got two last questions before I implement everything mentioned here:
- I'm thinking of entirely ditching handling different tag formats. I think standardising to lists makes sense as it will simplify things, avoid more options, and doesn't really have any drawbacks. We could then, when releasing the next version, announce this and offer an update path (which we've already implemented in
papis doctor, see #656 ). I would still leave a check to see that the type found in theinfo.yamlis indeed a list, and if not warn the user. Do you think we should additionally have a warning saying that the command assumes lists (so people don't inadvertently add list tag items when their library is using strings)? Does papis have some functionality for displaying a warning only when called for the first time? - I kind of like the
papis QUERY add TAG1 remove TAG2synax, but I'm also a bit wary as it feels like going against the syntax used in most current papis commands. In fact, the situation is a bit confusing in papis:papis cacheandpapis bibtexuse grouped subcommands, but most other things do not. Is there some system behind when to use what or is it mostly about what's needed to most easily implement the functionality?
As always I apologize for the delay in my reply. There are really great ideas in this feed, these are my comments/thoughts
- I'm thinking of entirely ditching ...
You're right that I use space separated tags. In fact for me tags are just a string that I'll match agains, and I use it
for topics or projects, like papis open tags:proj-lala.
Downsides of tags as a string
- In order to have spaces in the tag, it is not really possible at the moment. This might be very upsetting for some people.
- Inherently it is then not clear with separator to use, I have an unconscious and innate hatred of commas and I guess that is why papis is more geared to spaces.
Downsides of tags as a list
- When editing manually documents, users must know that the format of the tags are such of a list. But the upside is that with your command we can have a more lean and short way of dealing with tags. X Apparently whoosh and most databases can just store the stringified python list in the database, so this is unlikely to bite us in the future. X Tags are non-greppable. (we can discuss if this is a downside) I think it is not valid since yaml is not supposed to be grepped for.
I would say that I support just having a way of storing tags, namely as a list. It will make the implementation and documentation clearer. The implementation is not really super complicated, in the sense that we could affort having this logic here, but I think if we are going to officially support tags like citations or notes, then we might as well choose a single format.
- I propose writing a function that checks for the formatting of the tags and offers to convert tags to their list format.
- I kind of like the
papis QUERY add TAG1 remove TAG2synax...
This kind of way by now is kind of typical to papis, for better or for worse.
papis explore also has this kind of behaviour.
This kind of behaviour is handy whenever you need a kind of pipe.
In the case of explore, you want to search for documents
online and you might pick one, or export them or add one to the
library.
In the case of tags, I don't see this rationale, therefore I would
be rather in favour of --foo flags.
Other commands
I would like to have a papis tag -al or
papis tag --all --list command. I think it should be straightforward.
In the future we can also implement a cache for tags.
External API
I would suggest creating a file à la citations.py
in papis/tag.py
in order to have there two functions to do the tag
managment. This can help me then to use the functions
from the web application. Soemthing lke
papis/tag.py
def get(doc: papis.document.Document):
tags = papis.get("tags", [])
if not isinstance(tags, list):
raise deprecation warning and parse according to space separation
return re.split(r"\s+", tags)
return tags
def add(doc: papis.document.Document, tag: str) -> None:
pass
def remove(doc: papis.document.Document, tag: str) -> None:
pass
def remove(doc: papis.document.Document, tag: str) -> None:
pass
Thanks a lot for the feedback @alejandrogallo, this is very useful!
- I propose writing a function that checks for the formatting of the tags and offers to convert tags to their list format.
The idea is to tell users (with that message that is shown on first use) that they can convert their tags to the required format with something like papis --set doctor-key-type-separator ', ' doctor --check key-type --fix --all.
This kind of way by now is kind of typical to papis, for better or for worse.
papis explorealso has this kind of behaviour. This kind of behaviour is handy whenever you need a kind of pipe. In the case ofexplore, you want to search for documents online and you might pick one, or export them or add one to the library.
In the case of tags, I don't see this rationale, therefore I would be rather in favour of
--fooflags.
Thanks for the explanation, that makes a lot of sense.
Other commands
I would like to have a
papis tag -alorpapis tag --all --listcommand. I think it should be straightforward.In the future we can also implement a cache for tags.
I think this is already implemented in the draft. There's an --all flag that will apply the tagging to all matching documents. Is that what you had in mind?
I would suggest creating a file à la
citations.pyinpapis/tag.py
Ok!
Ok, I have had an entirely different idea :sweat_smile:. Instead of adding a papis tag command, we could also improve the papis update command so it can handle lists (i.e. adding and removing items to/from a list). Then changing tags would just involve that command. It would also mean we could use papis update for the proposed related_to field, even if it is a list. As far as I can tell, all list items in the info.yaml only include unique items, so making a command that that can handle everything doesn't seem impossible.
Ok, I have had an entirely different idea 😅. Instead of adding a
papis tagcommand, we could also improve thepapis updatecommand so it can handle lists (i.e. adding and removing items to/from a list). Then changing tags would just involve that command. It would also mean we could usepapis updatefor the proposedrelated_tofield, even if it is a list. As far as I can tell, all list items in theinfo.yamlonly include unique items, so making a command that that can handle everything doesn't seem impossible.
I like that! papis update should be able to handle most of this with a few more flags (--append at least) and could nicely generalize to other keys. If it turns out that we need some special handling for tags, we can always make a separate papis tag command that reuses some of the logic from papis update.
hmmm I don't know, so right now this logic can hold.
We currently support kind of 3 things, papis_id, notes and citations.
For only one of these there is an ad-hoc command, papis citations.
We can update the papis update command nevertheless, IMHO this would be
-
papis update --append <thing> <key> -
papis update --list-append/--push <thing> <key> -
papis update --remove <key> -
papis update --list-remove <thing> <key>
Only, how do we do for listing all tags? The above approach fails to deliver this. I think from the point of view of user friendliness, it is much clearer that we explicitly support tags if we have a tag command.
I am therefore for both, tags command and update of update ;)
I think personnaly also than a tag command would be clearer.
Thanks for all of the input. I'm gonna improve the papis update command (along the lines of what @alejandrogallo suggested) and then implement papis tag mostly just re-using the logic of papis update :)
@alexfikl, with all the work done in papis update, adding a papis tag is now quite easy. I hacked this together quickly, but it seems like everything just works as expected :laughing:. I've had one issue with typing, if you could take a look at the NOTEs in tag.py, that would be very helpful. Also, any other comments you have are of course also appreciated!
EDIT: Just FYI because it's been a while. I've implemented the list-only version of the command as we discussed previously. There's a warning to this effect, and a config option with which one can make the warning go away :).
EDIT2: One more question: I'm thinking to remove the --batch flag, because all the ways in which papis update could fail (and which --batch is meant to ignore) can't happen with papis tag (I think). Now, do you think I should still check the success values returned by the various functions from papis update used in papis tag, to catch errors just in case of something unforeseen (and maybe have a message à la "If you encounter this error, please open a bug report"? Or should I just ignore all of these return values?
@alexfikl thanks a lot for all the comments -- they're very helpful! Before i implement these changes and add some tests, what's your view on this (originally an edit to an above message):
EDIT2: One more question: I'm thinking to remove the
--batchflag, because all the ways in whichpapisupdate could fail (and which--batchis meant to ignore) can't happen withpapis tag(I think). Now, do you think I should still check the success values returned by the various functions frompapis updateused inpapis tag, to catch errors just in case of something unforeseen (and maybe have a message à la "If you encounter this error, please open a bug report"? Or should I just ignore all of these return values?
@alexfikl thanks a lot for all the comments -- they're very helpful! Before i implement these changes and add some tests, what's your view on this (originally an edit to an above message):
EDIT2: One more question: I'm thinking to remove the
--batchflag, because all the ways in whichpapisupdate could fail (and which--batchis meant to ignore) can't happen withpapis tag(I think).
Hm, it seems like it serves to purpose here, so I'm fine with removing it.
Now, do you think I should still check the success values returned by the various functions from
papis updateused inpapis tag, to catch errors just in case of something unforeseen (and maybe have a message à la "If you encounter this error, please open a bug report"? Or should I just ignore all of these return values?
If it's not too confusing, it might be worth to check the success status and complain if something unexpected happens. The main issue with these things is, like you said, avoiding to mess with the user's library..
Ok perfect, that's how I see it too. Thanks for the feedback :blush:
I've removed the --batch flag and now papis tag just checks if we find non-lists and aborts if it does. I've added some tests too (copied and adapted from papis update). So, this should be getting ready :).
Two questions though:
- In the tests, we always pass
TemporaryLibraryinto the functions, even if it's not actually used (at least that's what I saw in some random other files I looked at). I realised that both in the tests forpapis updateandpapis tag, not a single test uses this. Should I still leave it in? - Messages given when running
papis tagwill sometimes come frompapis update, which I think is a bit confusing. This reminded me of something @alejandrogallo said when we discussed this command a while ago, namely that he'd like the core functionality ofpapis tagto be accessible with an API in antag.pyor so in the basepapisdirectory. His reasoning was to then use this from the webapp, but I guess it could also help with the messages. My question is if there's a policy/custom on how e.g. commands may refer to each other: is it ok? should it always be via an 'API' module? Should I create aupdate.pyand atag.py? And should we merge this first and then fix both of them (given that we've already merged the changes topapis update?
In the tests, we always pass
TemporaryLibraryinto the functions, even if it's not actually used (at least that's what I saw in some random other files I looked at). I realised that both in the tests forpapis updateandpapis tag, not a single test uses this. Should I still leave it in?
Yes! Those are pytest fixtures that set up a temporary configuration and library somewhere in /tmp/ for use in the test. This actually gets used even if you don't use the tmp_library variable because it points to a different config file and a small test library. For example,
- Any time the test calls a function that calls
papis.config.get, it looks in/tmp/some-folder/papis/configinstead of~/.config/papis/configwhere the global user config is. - Any query for documents will return documents in the small test library instead of some user library.
This is only relevant when running the tests locally (not the CI). Plus, I don't think this is relevant for you, since you run the tests in a docker, but I just use a virtual environment and it would pick up my global config.
Messages given when running
papis tagwill sometimes come frompapis update, which I think is a bit confusing. This reminded me of something @alejandrogallo said when we discussed this command a while ago, namely that he'd like the core functionality ofpapis tagto be accessible with an API in antag.pyor so in the basepapisdirectory. His reasoning was to then use this from the webapp, but I guess it could also help with the messages. My question is if there's a policy/custom on how e.g. commands may refer to each other: is it ok? should it always be via an 'API' module? Should I create aupdate.pyand atag.py? And should we merge this first and then fix both of them (given that we've already merged the changes topapis update?
I don't think there's any particular policy (?) for that sort of stuff. There are some commands that call papis edit here and there, so it should be fine.
Not sure what @alejandrogallo had in mind, but it would be nice to have a top level tags.py file to help with handling tags. We can always refactor the command and move some things around though, so it should be fine to merge this now and do that when we have some time!
One issue that comes to mind with an API for this sort of stuff is that most of the difficulty here comes from the fact that we get a string from the command line and need to know what to do with it when adding it into the document. However, in code this should "just work" because we can pass in a list or an int and not do the try_parsing_str and isinstance(value, key_types[key]) stuff..
Messages given when running papis tag will sometimes come from papis update, which I think is a bit confusing.
This is mostly just a small design issue. If instead of returning success flags, the functions raised an exception, the caller could handle the message and we could write out something specific for update vs tag.
I guess for now, it should be sufficient to just remove very explicit call "papis update ..." sort of messages from those functions? Most of them were ok from what I recall..
Yes! Those are pytest fixtures that set up a temporary configuration and library somewhere in
/tmp/for use in the test. This actually gets used even if you don't use thetmp_libraryvariable because it points to a different config file and a small test library. For example,
Oh, I see! Thanks for the explanation :)
Not sure what @alejandrogallo had in mind, but it would be nice to have a top level
tags.pyfile to help with handling tags. We can always refactor the command and move some things around though, so it should be fine to merge this now and do that when we have some time!One issue that comes to mind with an API for this sort of stuff is that most of the difficulty here comes from the fact that we get a string from the command line and need to know what to do with it when adding it into the document. However, in code this should "just work" because we can pass in a list or an int and not do the
try_parsing_strandisinstance(value, key_types[key])stuff..
Ok, I'll make a note of that and sort this out later.
This is mostly just a small design issue. If instead of returning
successflags, the functions raised an exception, the caller could handle the message and we could write out something specific for update vs tag.I guess for now, it should be sufficient to just remove very explicit
call "papis update ..."sort of messages from those functions? Most of them were ok from what I recall..
I just went through all the messages and only some of those that papis tag can't reach (because they have to do with type mismatches) contain references to papis update, so we should be fine. I hadn't thought of using exceptions -- that makes a lot of sense! I'll make a note of that too, and improve this later (it would make sense to do all that in one fell swoop when factoring out things to toplevel tags.py and update.py files).
@alexfikl, do you think this is ready to merge? :)
I've implemented all the changes and force pushed them to my fork but somehow they don't show up here :thinking:
EDIT: ok, now it shows up!
Thanks for all the help @alexfikl! :heart: