promnesia icon indicating copy to clipboard operation
promnesia copied to clipboard

REF: Add tags into context

Open ankostis opened this issue 4 years ago • 5 comments

I see that hypothesis source does not harvest tags, correct? Although this is fixable, it made me wonder whether tags should be a new column in promnesia db?

ankostis avatar Feb 15 '21 23:02 ankostis

Yeah, good suggestion! Just adding to context is a quick & easy workaround.

In general there are all kinds of metadata that could be useful, e.g:

  • org-mode has tags
  • reddit has users/subreddits/etc
  • github (https://github.com/karlicoss/promnesia/issues/194) has tags/repos/users
  • instant messaging is associated with a person
  • tweets have authors

I need to think a bit more what's the best way to handle it.. because adding stuff to db is easy. The interesting question is what t do with it in the UI -- although just displaying it could be a good start. In the future would be interesting also to filter/search by metadata, but that's the hard part since involves frontend changes :)

karlicoss avatar Feb 16 '21 02:02 karlicoss

Just made a PR #199 to include in Hypothesis rows a new context-line with (hash-)-tags; the same can be done with github(labels), gmail(labels).

BUT when trying to search by a hash-prefixed tag e.g. #freedom, the extension takes too much time to respond, and this could be an escaping issue (and probably a security vulnerability?).

These are the logs when searching freedom:

[INFO    2021-02-16 13:36:42 promnesia.server server.py:270] /search freedom
[INFO    2021-02-16 13:36:42 promnesia.server server.py:164] url: freedom
[INFO    2021-02-16 13:36:42 promnesia.server server.py:167] normalised url: freedom
[DEBUG   2021-02-16 13:36:42 promnesia.server server.py:180] query: SELECT visits.norm_url, visits.orig_url, visits.dt, visits.locator_title, visits.locator_href, visits.src, visits.context, visits.duration 
    FROM visits 
    WHERE (visits.norm_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.orig_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.context LIKE '%' || ? || '%' ESCAPE '/') OR (visits.locator_title LIKE '%' || ? || '%' ESCAPE '/')
[DEBUG   2021-02-16 13:36:43 promnesia.server server.py:193] got 57 visits from db
[DEBUG   2021-02-16 13:36:43 promnesia.server server.py:204] responding with 57 visits
127.0.0.1 - - [16/Feb/2021 13:36:43] "POST /search HTTP/1.1" 200 30282

While this is when prefixed with hash(#), where all my visits are returned:

[INFO    2021-02-16 13:38:46 promnesia.server server.py:270] /search #freedom
[INFO    2021-02-16 13:38:46 promnesia.server server.py:164] url: #freedom
[INFO    2021-02-16 13:38:46 promnesia.server server.py:167] normalised url: 
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:180] query: SELECT visits.norm_url, visits.orig_url, visits.dt, visits.locator_title, visits.locator_href, visits.src, visits.context, visits.duration 
    FROM visits 
    WHERE (visits.norm_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.orig_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.context LIKE '%' || ? || '%' ESCAPE '/') OR (visits.locator_title LIKE '%' || ? || '%' ESCAPE '/')
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:193] got 13035 visits from db
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:204] responding with 13035 visits
127.0.0.1 - - [16/Feb/2021 13:38:46] "POST /search HTTP/1.1" 200 7632748

ankostis avatar Feb 16 '21 11:02 ankostis

Thanks, the change looks good!

So I think I know why this happens. Here's the bit of code responsible for searching stuff: https://github.com/karlicoss/promnesia/blob/aded41c5271c57d704f721d48670ee5135cc5f7c/src/promnesia/server.py#L271-L280

, as you can see it searches in several different columns (and somewhat confusingly, paramter name is 'url' even though it really means anything you typed in the search box).

But it delegates to search_common to avoid code duplication between endpoints https://github.com/karlicoss/promnesia/blob/master/src/promnesia/server.py#L160

However, this function also normalizes the url passed to it: https://github.com/karlicoss/promnesia/blob/aded41c5271c57d704f721d48670ee5135cc5f7c/src/promnesia/server.py#L164-L167

In particular, at the moment that means stripping out the 'fragment' part of the URL (there is a plan to do something smarter, but still work in progress https://github.com/karlicoss/promnesia/blob/aded41c5271c57d704f721d48670ee5135cc5f7c/tests/cannon.py#L127-L142 ) So as a result, when you search #freedom, it ends up normalizing this to empty string, and this results in matching against the whole database.

So I guess there are several things we could do here

  • non-fix: if you search for freedom without the hash, it should work straightaway :) However the downside is that it would also find all URLs that contain freedom as a substring as an example Ans yeah, it's not intuitive that freedom normalizes in freedom even though it's not a valid URL --but I'm not sure was intended, good testcase though to think about.

More reasonable solutions:

  • maybe if normalisation results in empty string, it shouldn't try searching url. However one might want to retrieve their whole database (especially when some kind of pagination is implemented?).
  • maybe the search_common function should guess that the thing you're passing is not a proper URL and not try searching in it. However there are valid cases when you might type some approximate domain name and try to find.. so not sure

This would probably be easier if we had a richer search interface (so e.g. you could tick whether you want to search context/url/etc).

Either way, I'm happy to merge your PR if you are, the issues should be worked around separately.

karlicoss avatar Feb 16 '21 18:02 karlicoss

I would prefer that the PR is self-contained, and it works as expected. If you don't mind, i would add a committee to try to workaround(1): if the original query-str is not empty and the canonized it is, use the original.

ankostis avatar Feb 17 '21 09:02 ankostis

Actually the same has to happen on the extension js-code [edit:] for bookmarks & history: https://github.com/karlicoss/promnesia/blob/ae0ce944476dcf9fd51aecf95670c8ac9038691b/extension/src/api.js#L38

Also need to trim the URL before searching; would there be any problem to that?

ankostis avatar Feb 17 '21 10:02 ankostis