remark42 Strip query params in passed locator.URL

as #411 indicated, URL params in locator may lead to ambiguity. We can strip it on backed or frontend level. Such action should be optional, but I believe we may want to do it by default. It also should be configurable (i.e. enabled/disabled) per site.

@Reeywhaar - what do you think about the idea? Should we deal with this on your side or better to translate on the backend? To me, it sounds like some param in comment widget (ON by default) can do the trick.

Aug 16 '19 17:08 umputun

theoretically, I can think of use cases with query params essential for the locator, but in practice, I don't see many examples like this

Aug 16 '19 17:08 umputun

Yes, agree that such thing should be optional and on by default. Though sometimes query is significant (however should be not).
I think this should be done on server side, as making it on frontend means that I'll have to trim query, thus, we lose some info that may be useful later.

Aug 16 '19 18:08 Reeywhaar

I'll have to trim query, thus, we lose some info that may be useful later.

Not sure I get it. The goal is to lose some info. Even if UI sets location.url with the full query it won't exist anywhere after server trims it. I can't imagine why we would like to do smth with the full url later. I also think if we want to strip it, I'd rather strip it on the highest level possible, which is the input request UI makes.

Another reason why I think UI is a better side for this - to do it selectively on server-side, i.e. strip for some sites and allow for others, we will need a whole new set of parameters per site. But if this just UI's level "allow_query_keys" (or smth like this) server doesn't need to do anything and will continue to treat incoming locator as a valid/final one.

this solution is somewhat close to the one I proposed in #411 as a hotfix on nginx level. Probably we can leave it on nginx level and properly document the "fix", but in case if remark42 exposed directly it won't work. So, nginx's solution doesn't feel right to me

Aug 16 '19 19:08 umputun

I mean I suggest to store full url, and only trim it on comparison.

I think this option should be per instance of remark, as I think it's rarely used on multiple sites.

How I see it:

User visits https://somesite/somepost?smth=1
JS requests api/v1/find?url=https://somesite/somepost?smth=1
if option enabled backend trim both incoming and stored urls and compares them, if not compares them raw
backend returns result

...

Ok, after some thinking, maybe we should do nothing but add some notes to readme regarding url in remark_config that it should contain canonicalized url?

There is some tick in my head that something is not right, and if we do allow query params then something would inevitably go wrong (like with utm, fbclid params we saw above).

What do you think?

Aug 16 '19 20:08 Reeywhaar

Or maybe we should change default value of url from window.location.href to window.location.origin + window.location.pathname

Aug 16 '19 20:08 Reeywhaar

the idea of storing the full url but matching optionally on trimmed is hard (I'd even say unnecessary hard) to implement. For each storage type, it will be a pain - for bolt instead of direct matching by key, it will have to pull all keys and check one by one or run some prefix scan, for mongo this means some partial match or text search with very problematic indexing.

maybe we should do nothing but add some notes to readme regarding url in remark_config that it should contain canonicalized url?

I don't know if this something users can always do. For example, if they use a third-party site generator or blog engine where query params used for the real navigation to posts.

There is some tick in my head that something is not right, and if we do allow query params then something would inevitably go wrong (like with utm, fbclid params we saw above).

Not sure I understand. Currently, we do allow utm, fbclid and we do see how it may go wrong. I don't think changing the default and stripping all this noise will lead to some unexpected problems, but rather will make users happy ;)

Aug 16 '19 20:08 umputun

Or maybe we should change default value of url from window.location.href to window.location.origin + window.location.pathname

If this will give us "clean" url - sure

Aug 16 '19 20:08 umputun

Ok, let's change it this way. If user want to have query then he will have to parse location manually.

I don't know if this something users can always do. For example, if they use a third-party site generator or blog engine where query params used for the real navigation to posts.

Not sure I understand. I think user that setups remark also has access to blog's template, or how else he will put remark init script? By canonicalized I mean url that has all differentiative properties (in right order, if we talk about query) and nothing else (like trackers params). I mean If user has query scheme for posts (e.g https://smth.com/?id=1) then it's up to him to configure remark initialization url. Anyway, let's change default.

Aug 16 '19 21:08 Reeywhaar

setups remark also has access to blog's template, or how else he will put remark init script?

I see what you mean, but you expect some really smart users :) For instance, I have no clue how to deal with this in case of hugo, probably somehow to pass head's rel=canonical href, not sure. As a simple man, I would just set nginx rewrite for this to convert https://radio-t.com/p/2019/08/13/prep-663/?blah=foo to https://radio-t.com/p/2019/08/13/prep-663/ ;)

Aug 16 '19 21:08 umputun

Haha, I thought of that, and radio-t already handles this well via js, as well as news.radio-t.com

https://github.com/radio-t/radio-t-site/blob/282afc4a946d5aee88b243eb40905e2a85108fae/hugo/src/js/controllers/comments_embed_controller.jsx#L15-L17

Aug 16 '19 21:08 Reeywhaar

to pass head's rel=canonical href

... hmm, good variant by the way! First time I didn't pay attention. May be we should consider to add this to lookup chain so we would have link[rel="canonical"] first and then window.location.origin + window.location.pathname? It starts to be a bit magical, but still very convenient. Or at least add to readme somewhere.

Aug 16 '19 22:08 Reeywhaar

up to you

Aug 17 '19 19:08 umputun

Analyzing how remark API works I noticed one thing that could be related https://radio-t.com/p/2018/09/22/podcast-616/ - original page with comments (can't comment, readonly) https://radio-t.com/p/2018/09/22//podcast-616/ and https://radio-t.com/p/2018/09/22///podcast-616/ - another comments (can't comment, readonly) https://radio-t.com/p/2018/09/22////podcast-616/ - here I can comment (not readonly)

I'm not sure is it remark related problem. Maybe nginx could prevent this, doing redirect from url with several slashes to single

Aug 24 '19 13:08 akosourov

looks like using canonical link will solve it.

Another thing we can do is to handle such multiple slashes on the backend side and replace by a single slash.

btw, it seems to be a problem in almost every commenting system working with sites allowing such multiple slashes in the url.

Aug 24 '19 16:08 umputun

@Reeywhaar & @akosourov - can you guys see a valid use of such urls? I'm thinking of replacing such urls unconditionally, but if you think we need to support it, I could make this a part of the same, optional opt-in parameter we may use for locator's sanitizing on the backend side.

Aug 24 '19 16:08 umputun

pls ignore the part about "we can do is to handle such multiple slashes on the backend side". As discussed earlier it makes sense to handle location urls on the frontend only. So, looks like @Reeywhaar suggestion of " we should consider to add this to lookup chain so we would have link[rel="canonical"] first and then window.location.origin + window.location.pathname" should be sufficient.

Aug 24 '19 17:08 umputun

can you guys see a valid use of such urls?

No, I think urls with multiple slashes are treat as url with single slash so that remark shouldn't distinguish them by default. I mean I've never seen sites where requests to url with single and multiple slashes would return different pages.

Aug 25 '19 05:08 akosourov

FYI: if you can't reproduce the issue with urls like https://radio-t.com/p/2018/09/22////podcast-616/ anymore, this is because of the nginx rewrite I have added

  if ($request_uri ~ "^[^?]*?//") {
      rewrite "^" $scheme://$host$uri permanent;
  }

Aug 25 '19 07:08 umputun

We have yet to receive new reports after the v1.5 release, resolving this issue. Please comment if I'm wrong and you still see it.

Jan 08 '23 09:01 paskal

remark42 remark42 copied to clipboard

Strip query params in passed locator.URL

remark42
remark42 copied to clipboard