libkiwix kiwix-serve and url encoding

On the PR #764, @veloman-yunkan raised a important question https://github.com/kiwix/libkiwix/pull/764#discussion_r866235011 As the issue title suggests, it is about url encoding

In libkiwix, we have a function to urlEncode a string : https://github.com/kiwix/libkiwix/blob/master/src/tools/stringTools.cpp#L146-L230 This function has a boolean parameter encodeReserved and the function does:

keep alphanumerical char as it is
keep char in -_.!~*\() as it is
if encodeReserved is false : keep char in ;,/?:@&=+$ as it is (else encode it)
encode every other char

However, it doesn't correspond to the standard specified in rfc3986 see section 2.2. Reserved chars can be (at best) but in two sets:

gen-delims = :/?#[]@
sub-delims = !$&'()*+,;=

As the name of the sets suggests, they are use to delimitate components in the url. Readers (url parsers) must separate components first using the delimiters and then url decode each components. On the opposite side, writer must url encode the components and then composed them using delimiters. This make the delimiter chars sometime encode, sometime not.

From what I understand from the spec, we should always build url this way:

std::string url = scheme + "://" + urlEncode(host);
for (auto& part: path_parts) {
 url += "/" + urlEncode(part);
}
char sep = '?';
for (auto& query_part: query) {
  url += sep + urlEncode(query_part.key) + "=" + urlEncode(query_part.value);
  sep = '&';
}

urlEncode encoding the reserved chars. Of course, this is the simple version. A complete version should handle different scheme, authentication (user@host), anchor (#title), ....

However, everything is not so easy. From zim spec, the url of the articles (path of the item) must NOT be url encoded. So they can contains things like /, ?/& and #. And this become funny:

/ is "normal". / must be interpreted by the server as a component separator. libmicrohttp allow use to access the whole path (including /) and so we can find entry with a /. We must not url encode / as it is interpreted by the browser to build absolute path from relative path (css style, images,...). By encoding / (A%2FFoo instead of A/Foo) we can get the entry from kiwix-serve but the styling is broken (relative link img.png will be img.png instead of A/img.png)
?/& are interpreted by libmicrohttpd as a querystring. We can access the query key/value, but we cannot get the exact string. So we must urlencode ?/& to prevent libmicrohttp to interpret them.
# is interpreted by the browser for the anchor. The anchor is not passed to the server. So we must urlencode it to have it in the url.

So it seems we need several encoding function, depending of the context:

encodeComponent() which encode "everything" (everything but "normal" char).
encodeEntryPath() which encode everything but /
A encodeUrlSafe() which encode only chars that can be misinterpreted by html parser (") but not change the parsing of the url itself ?
helper method to build url from different parts (buildUrl(string host, vector<string> path, map<string, string> query, ...)) ?

It seems that different kind of component may contains specific separator not encoded. However, it seems it is not a problem to encode them. So for simplicity we can url encode "everything", no need for specific encodeFoo for each Foo.

Did I miss something ?

Question about scrapper (@rgaudin @kelson42) : If we want to have article containing ?/&/# in the path, the links pointing to them must be properly urlencoded on the scrapper side. I let you check that :)

May 17 '22 14:05 mgautierfr

As long as libkiwix operates as a webserver would, I think scrapers would be marginally impacted. We have two kind of scrapers:

built-from-scratch ones. Those usually don't make use of query parameters and use plain (UTF8) links and paths. Use of fragment is frequent in links but not in paths.
generic-ones. Those tend to reuse links from original websites so query strings and fragment are to be expected. Paths are transformed so none of those characters should be present. Paths are decoded (urllib.parse.unquote) UTF-8 strings . We'd generally use urllib.parse.quote for links which is RFC3986 compliant.

May 17 '22 15:05 rgaudin

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Aug 13 '22 11:08 stale[bot]

I strongly believe the time has come to fix once for ever these kind of encoding problem. @veloman-yunkan has alreqdy started, so I would suggest he tacles the problem as a whole. Automated testing is here the key.

Dec 26 '22 14:12 kelson42

@veloman-yunkan Any chance tomget the final PR to close this ticket based on https://github.com/kiwix/libkiwix/compare/main...robust_uri_encoding ?

Feb 02 '23 20:02 kelson42

@kelson42 I think that recent PRs #866, #870, #890 have reasonably improved the situation. As noted in the description of #870 the solution proposed in this ticket looked overdesigned and I can't justify spending further effort on it. We better close this ticket now. We can reopen it later if new use-cases for URL encoding are discovered for which the current implementation falls short.

Feb 10 '23 17:02 veloman-yunkan