libkiwix icon indicating copy to clipboard operation
libkiwix copied to clipboard

Provide article abstract

Open automactic opened this issue 7 years ago • 6 comments

Currently, we use Xapian::MSet::snippet top generate snippet. The result is often less optimal.

Take result "Secure shell" when searching "ssh" as example, the snippet we provide:

...19] In 2006, a revised version of the protocol, SSH-2, was adopted as a standard. This version is incompatible with SSH-1. SSH-2 features both security and feature improvements over SSH-1. Better security, for example, comes through Diffie–Hellman key exchange and strong integrity checking via message authentication codes. New features of SSH-2 include the ability to run any number of shell sessions over a single SSH connection.[20] Due to SSH-2's superiority and popularity over SSH-1, some...

A much better snippet would simply be:

Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.

The purpose for snippet is to provide a quick overview about the article, therefore it is important for it to be glanceable and scannable. It would be a good idea for it to be a simple, relatively short statement or definition, rather than a long sentence with clauses.

It might be a good idea that for all media wiki based zim files, we simply use the first sentence as snippet. (In fact, that is what the second snippet is. It is also what Google provides when you search "ssh".)


Xapian::MSet::snippet provides snippets based on where query matches. As you can already seen, the xapian snippet contains a lot of "ssh".

However, I would argue users would be more interested in quickly determining if they are interested in the article, rather than how their query matches with article content.

As a side effect, the snippet becomes dependent on the search term. If I search "Secure shell" instead, the snippet would be totally different for the same article, which is not entirely how it should behave.

Another added benefit would be to implement "knowledge glances" when user hover over / long press a link. (This function already exists in wikipedia websites). The snippet could exist without having to have a search text to provide basic description about the article.

automactic avatar Jun 13 '18 18:06 automactic

@automactic Would be great to have more details about what should be done here.

kelson42 avatar Jun 13 '18 18:06 kelson42

@kelson42 I haven't finished editing yet

automactic avatar Jun 13 '18 18:06 automactic

Yes, you're right @automactic. It seems a good idea to have snippet generated by xapian based on the query itself (Having search terms highlighted in the search result) when I implement it but it seems that the gains are not so obvious :

  • It takes long time to generate as libzim have to extract all the content of the article, parse the html to remove any html element and give the content to xapian to make it generate the snippet.
  • The snippet generated may not be representative of the article content.
  • Snippets for a same article may (will) be different depending of the query.

What @automactic suggests is to have some kind of extract/resume/snippet/introduction/abstract available to display to the user (in search result or else where). I mainly agree with the idea however it would be not so easy to implement.

One solution would be to generate it dynamically when needed. No extra data to store but we will have to get all the content, parse it and generate the abstract ourselves. (and so, will not greatly help with #146)

Another solution is to store the abstract in the zim file. And comes the question of where. If we want it to be independent of the search (to implement "knowledge glances"), we cannot store it in the xapian database so we have to see the zim file format to know where we can store it.

I spare you from my crazy ideas where we could store it and only explain what seems the best solution : We could store the abstract in blob/clusters as any other article content, and use the article parameter to point to the correct blob/cluster containing the abstract. This would add a little overhead (9 bytes per article having a abstract, not counting the abstract itself), be fully compatible with older implementations and the abstract would be stored in a efficient (compressed) manner.

mgautierfr avatar Jul 12 '18 07:07 mgautierfr

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Nov 21 '19 00:11 stale[bot]

This ticket is pretty old but the need I guess is still here. It is time to refresh it.

I’m not supporting the idea that the current snippet approach is wrong. We have mainly solved the performance problem and it does what is should do, following the right technical approach. If the snippet would have to be improved, a ticket should be open at Xapian level first.

The need expressed here is different, it requires an abstract in place of the Snippet arguing that this would be more valuable. The problem is that we have no solution for the moment to provide such an abstract. It is not clear as well to me how it should work (first sentence, first section, ...)?

kelson42 avatar May 02 '21 04:05 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Jul 13 '21 00:07 stale[bot]