kiwix-tools icon indicating copy to clipboard operation
kiwix-tools copied to clipboard

Remove OPDS "notag" parameter

Open kelson42 opened this issue 3 years ago • 14 comments

The problem with this notag parameter is that it complexifies the requests and does not solve the fullproblem. For example it is impossible to make a request with ZIM files having tag1 OR tag2... and this seems to be a pretty legitimate and useful request.

Considering that we plan to use an in-memory Xapian DB to search in description/title, see https://github.com/kiwix/kiwix-lib/issues/106, I wonder if we should not put as well the tags/categories in it and allow to make search via simple Xapian operators https://xapian.org/docs/queryparser.html?

kelson42 avatar Mar 03 '21 12:03 kelson42

"notag" argument of a query on the opds service is independent of how we implement the library database (xml, xapian or whatever).

The "notag" query is used in kiwix-desktop to get the "other" category. kiwix-desktop doesn't know the exact list of categories available and use a static list. So when we want to filter zim file to display all zim not in the "known categories" we have no option than exclude zim files in the categories we know. If we have a way to know all categories(#318), we can do a filter to only include zim files in the categories we don't display to the user.

mgautierfr avatar Mar 03 '21 17:03 mgautierfr

@mgautierfr Thx for clarifying the usage of notag. To me:

  • whatever the way how we "phrase the request", it seems that the "not in all this category_list" is the right approach and we should be able to do that just by providing the NOT Xapian keyword in the search pattern
  • Where the category_list comes from is an other somehow unrelated problem (but your description of the problem seems correct to me) and #318 should allow us to solve it.

kelson42 avatar Mar 04 '21 07:03 kelson42

@veloman-yunkan Can we move on with this ticket? Maybe we can keep the "notag" parameter for a bit of time to keep backward compatibility, but behind the scene this should work with XAPIAN operators.

kelson42 avatar Mar 28 '21 09:03 kelson42

@kelson42 I propose to open a new ticket for utilizing Xapian for all other book fields too (currently, only title and description are indexed). This ticket can be closed together with that one.

veloman-yunkan avatar Mar 31 '21 19:03 veloman-yunkan

@kelson42 I propose to open a new ticket for utilizing Xapian for all other book fields too (currently, only title and description are indexed). This ticket can be closed together with that one.

kiwix/kiwix-lib#484 was opened as proposed above

veloman-yunkan avatar Apr 10 '21 19:04 veloman-yunkan

@veloman-yunkan my bad, forgotten about your comment. perfect.

kelson42 avatar Apr 11 '21 05:04 kelson42

@veloman-yunkan https://github.com/kiwix/libkiwix/issues/484 and https://github.com/kiwix/kiwix-tools/issues/318 have been implemented but it seems that the notag parameter is still there and this is actually not obvious to me how to achieve to get a similar restult as the following request with the tag= OR/AND category=filters (is that possible at all?!).

https://library.kiwix.org/catalog/search?lang=eng&count=-1&notag=_category:gutenberg;_category:mooc;_category:phet;_category:psiram;_category:stack_exchange;_category:ted;_category:vikidia;_category:wikibooks;_category:wikihow;_category:wikinews;_category:wikipedia;_category:wikiquote;_category:wikisource;_category:wikiversity;_category:wikivoyage;_category:wiktionary

kelson42 avatar Feb 06 '22 14:02 kelson42

@kelson42 Preserving only two notag values from your example, the query should be

https://library.kiwix.org/catalog/search?q=lang:eng%20-tag:_category:gutenberg%20-tag:_category:mooc'

(the value of the q parameter with URL encoding removed is lang:eng -tag:_category:gutenberg -tag:_category:mooc).

However the : symbol in the tag values confuses the Xapian query parser, which utilizes that punctuation mark for separating the field name from the field value. For tags not containing the colon character, that query works OK.

veloman-yunkan avatar Feb 06 '22 17:02 veloman-yunkan

I somehow disagree with the idea of exposing the xapian request format in the API. Xapian is a implementation details. Filtering was made without xapian before and it may change in the future. If we allow the user to pass a plain xapian query, then it is part of the API and it is no more a implementation choice.

We must define a search API and provide it to the user/client. If we use xapian internally, we have will have to transform the search query (from our API) to a xapian query. If we decide to use something else, the API will not change.

mgautierfr avatar Feb 07 '22 16:02 mgautierfr

@mgautierfr OK in principle, but pragmaticaly, this seems not realistic. I don't want us to mockup/reimplement somehow the full operator parsing... and we need these operators right?

kelson42 avatar Feb 07 '22 17:02 kelson42

I don't know if we need them. It would be good to define what we need to support first.

The OPDS stream is mainly used by clients (program) to know the list of available zim and (potentially) filter them to display subset to the user. There is three use cases :

  • Do a request to get all zim files (without filtering) and then filter the results locally (I think it is what is done on IOs/MacOs application)
  • Do "filtered" request to get only the zim files to display to the user.
  • A mix of the two above. This is what is made in kiwix-desktop : we do filtered request for lang and category, but we do local filtering for all other filters.

On the OPDS side, we apply filtering on xapian request (q), maximum size (maxsize), name (name) , category (category), language (lang), tag (tag) and excluded tag (notag) The q parameter is plain passed to xapian, which allow to search on title, description, name, category, lang, publisher, creator and tag (IF the client developer knows how xapian db is constructed and requested). This is available because the server parse the OPDS request, and use a "local" Library and Filter to implement the filtering. The Library/Filter is what is also use in local filtering (and allow more that what is done in OPDS request)

We extend the search parameters with kiwix/libkiwix#459 and provide a way to construct a OPDS request with kiwix/libkiwix#527 but we never update the kiwix-desktop side and so we still construct the request ourselves and we still use the tag=_category:Foo instead of category=Foo

It seems to me that all those features are more a organic evolution and the combination of different features not especially design as a whole.

I would propose :

  • A API using the "functional" filters we need (name, description, lang, category, creator, publisher, video, image, details, size, date, tag ...). How we store the information (in a xapian db, in tags or specific attributes) should not be relevant for the API
  • Different (or duplicated) keys act as a AND: lang=en&category=wikipedia&tag=foo&tag=bar means zim files in english in the wikipedia category and with the tags foo and bar
  • A OR is possible inside a key with a | : lang=en|fr&category=wikipedia&tag=foo&tag=bar|baz means zim files in english or french, in the wikipedia category, with the tag foo and with the tag bar or baz.
  • A NOT is possible with a ~ before the key (do we want a NOT ?) : lang=en&~category=wikipedia&tag=foo&~tag=bar&~tag=baz means zim files in english, not in the wikipedia category with the tag foo but without the tag bar nor baz. lang=en&~category=wikipedia&tag=foo&~tag=bar|baz means zim files in english, not in the wikipedia category with the tag foo and without the tag bar or without the tag baz.
  • It is not possible to search for technical (hidden) tags using tag (_category:wikipedia, _video:yes). Use the corresponding key. tag is used to search in the left over tags (real tags)

It should not be so complex to implement, just check for the ~ at the beginning of the key and split the value with |. The ~key=foo|bar is a bit more complex but we can remove it from the API (I'm not sure we need it, I cannot find a use case)

mgautierfr avatar Feb 09 '22 15:02 mgautierfr

@mgautierfr I think your proposal covers current needs but I'm 100% supportive because:

  • It's not super flexible
    • Tends to create a lots of URL parameters (all the NOT, and multiple of the same if we want an AND operation beetween them)
    • Understand the overal request (to review or build) is easy because to a large extend implicit and hidden (in the code)
    • Does not allow mixing of parameters with logical operator
    • Does not allow complex operation with other operators, like parenthesis for example
    • Does not really scale easily if in the future we want to allow more complex requests
    • We need a bit of code to fully reassemble the multiple URL query string values

I propose something alternative:

  • To a large extend we allow to reason XAPIAN query string to express logical operation: AND, OR, NOT, GROUPS, ...
  • First we allow such Xapian query string on URL query string values (for the moment we only allow const values. For example we would have the lang URL query string value equal to (fr OR de) (or anything like this that Xapian can be fed with).
  • That would probably imply in our code that we recreate a Xapian query string and then give it to ead to the Xapian query parser.
  • Code to do would be minimal, just concatenate and add "ADD" beetween the URL values.
  • This would be pretty future proof, because almost no logic between URL representation of the query and Xapian representation

kelson42 avatar Feb 06 '23 13:02 kelson42

Tends to create a lots of URL parameters (all the NOT, and multiple of the same if we want an AND operation beetween them)

Your solution only have one parameters indeed. But the only one paramater will contains all the query and will include &/AND itself and such. I don't see a real improvement here.

Understand the overal request (to review or build) is easy because to a large extend implicit and hidden (in the code)

(I suppose a not is missing here) I disagree here. The current implementation is having a lot of implicit (and inconsistent behavior) ~~(See https://github.com/kiwix/kiwix-desktop/pull/965#issuecomment-1715316561, we have lang=foo,bar which is a OR and category=foo,bar wihch is a AND)~~(This has been fixed) And ?lang=en|fr&~category=wikipedia&tag=video&tag=foo|bar&~tag=baz is not less readable than ?xapian_query=lang:(en OR fr) AND NOT category:wikipedia AND tag:video AND tag:(foo OR bar) AND NOT tag:baz

Does not allow mixing of parameters with logical operator

I'm not sure do understand what is the need here. My solution provide logical operator, so you must think about something specific

Does not allow complex operation with other operators, like parenthesis for example

Indeed. Do we really need it ?

Does not really scale easily if in the future we want to allow more complex requests

Difficult to say which future complex requests I suppose. But I'm not sure we will support and need querying a subset of the available book by applying a set of filter.

We need a bit of code to fully reassemble the multiple URL query string values

You will need a bit of code to fully reassemble the multiple filter in a xapian query string.

This would be pretty future proof, because almost no logic between URL representation of the query and Xapian representation

Indeed. But I really dislike exposing an API from our dependency as our main API. If for some reason we need/want to move out of xapian, we are stuck.

The mapping from my solution to the xapian query is pretty easy :

  • lang=en|fr&~category=wikipedia&tag=video&tag=foo|bar&~tag=baz :
  • Replace | by OR in the value. If they is a | in the value, enclose with (). => lang=(en OR fr)&~category=wikipedia&tag=video&tag=(foo OR bar)&~tag=baz
  • Replace key= by key: and ~key= by NOT key: lang: (en OR fr)&NOT category:wikipedia&tag:video&tag:(foo OR bar)&NOT tag:baz
  • Replace & by AND => lang:(en OR fr) AND NOT category:wikipedia AND tag:video AND tag:(foo OR bar) AND NOT tag:baz

(Of course, we may not use simple string replace and use high level object to do the conversion, we already have a Filter class which parse a query string and create a xapian query)

mgautierfr avatar Sep 12 '23 15:09 mgautierfr

I like this proposal because it covers a lot with minimal effort (I think!), acknowledging that it has known limitations. It's a good way to bring clarity, readability and flexibility to existing use cases (should they all be met).

Maybe all current OPDS API users could list their current and expected use cases (and query format). That would help ensure there's no dead spot, and could serve to write unit tests.

If all are met with this and it's as easy as it sounds to implement, it could be a good solution for now (my opinion!)

rgaudin avatar Sep 12 '23 16:09 rgaudin