LinkAce Link duplication : should have the possibility to ignore some website with the dup check

Link duplication : should have the possibility to ignore some website with the dup check

Open julien-vu opened this issue 2 years ago • 6 comments

When I add multiple video from youtube for example, I get duplicate warning however those are different video.

Should be possible to ignore some websites with an ignore list

Jul 07 '21 12:07 julien-vu

The problem with YouTube links is, that they have the video ID as a query parameter, which are currently ignored when links are checked for duplicates. I have to check a few URLs and think about a proper solution. Some sort of whitelist might be a good idea.

Aug 08 '21 21:08 Kovah

An editable whitelist of sites/domains that rely on query params would make sense (pre-populated with Youtube and others).

Dec 27 '21 23:12 oscarfroberg

Quite a lot of websites use query parameters for page differentiation, including some of my own but also popular third party ones like Hacker News and, uh, a niche little place called youtube indeed :). I haven't run into this myself yet but am surprised to see it's entirely ignored.

Might it make more sense to instead do the duplicate check on the whole URL, except perhaps a pre-populated but configurable list of dummy parameters such as utm_*?

May 19 '23 01:05 chrissawyerfan4

I am using this in my fork, I strip out all utm parameters and make a full comparison:

public function searchDuplicateUrls(): Collection
    {
        $parsed = parse_url($this->url);

        if (!isset($parsed['host'])) {
            return new Collection();
        }

        $auth = $parsed['user'] ?? '';
        $auth .= isset($parsed['pass']) ? ':' . $parsed['pass'] : '';

        $uri = $parsed['scheme'] . '://'; 
        $uri .= $auth ? $auth . '@' : '';
        $uri .= $parsed['host'] ?? '';
        $uri .= isset($parsed['port']) ? ':' . $parsed['port'] : '';
        $uri .= $parsed['path'] ?? '';

        $pattern = "/utm_.*?=.*?(&|$)/";
        $replacement = "";
        $uri_utm_free = preg_replace($pattern, $replacement, $uri);

        return self::where('id', '<>', $this->id)
            ->where('url', '=', trim($uri_utm_free, '?'))
            ->get();
    }

Sep 25 '23 13:09 sergiorgiraldo

Thanks for that @sergiorgiraldo, I was going to incorporate that into my local custom version as well but then noticed that this might not be the optimal solution. You triggered me to look into it further :)

I started wondering: why is this function parsing and rebuilding the URL at all? We want the query part to be considered, like the ?v=VIDEO_ID needs to be included in the comparison. We could add the query part to the rebuilding, but at that point we end up with the same URL that we started with! I think we can just remove all that code. So I did:

app/Models/Link.php patch

     public function searchDuplicateUrls(): Collection
     {
-        $parsed = parse_url($this->url);
-
-        if (!isset($parsed['host'])) {
-            return new Collection();
-        }
-
-        $auth = $parsed['user'] ?? '';
-        $auth .= isset($parsed['pass']) ? ':' . $parsed['pass'] : '';
-
-        $uri = $auth ? $auth . '@' : '';
-        $uri .= $parsed['host'] ?? '';
-        $uri .= isset($parsed['port']) ? ':' . $parsed['port'] : '';
-        $uri .= $parsed['path'] ?? '';
-
         return self::where('id', '<>', $this->id)
-            ->where('url', 'like', '%' . trim($uri, '/') . '%')
+            ->where('url', 'like', trim($this->url, '?/') . '%')
             ->get();
     }

Personally, I would already strip off any utm_whatevers manually. That's not the only tracking parameter that exists, or the only parameter that can be removed. In fact, I would also remove the index.htm from a link like example.org/index.htm because that's the default anyway and it might change in the future (maybe they start using PHP and move the code to index.php), breaking the link even if the homepage is still there! There are a lot of canonicalisation rules that we could come up with. (I realise that I am kind of contradicting my previous comment here. I wouldn't be opposed to a configurable list, either, but a hardcoded regex seems a bit too complicated.)

However, we do want to check for small variations that are likely duplicates. Typically, a trailing slash or question mark makes no difference. The path example.org/cookies/ is usually the same as example.org/cookies, so a duplication warning is in order, but: you often cannot just use either variant, so we cannot blindly trim the URL and store the trimmed version in the database. This means that, to find duplicates, the SQL query would need to apply the same transformation to all values in the database and then do the comparison, which makes the query O(n) instead of using an index at O(log n) or something. I'm sure it's premature optimisation to take this into consideration when my database will not have more than a thousand records any time soon, but beyond being bad practice, it also makes the code more complicated (probably we'd need custom SQL? I don't know Laravel well enough off the cuff). So my solution is to query for the prefix, with the characters stripped and a wildcard afterwards, and then do the full comparison in PHP. Instead of considering a thousand records O(n), we use an indexed lookup and only loop over the records that match. For my use-case, that is a handful of links at most. (However, this approach might pose a problem if adversaries can add links and cause this result to become huge and cause resource exhaustion for other tenants on your system.)

Essentially, we need to store example.org/cookies/ if that is what the user input (so including the trailing slash), but at the same time, we need to find code that warns for example.org/cookies yet not for example.org/cookies/chocolate.

This is the patch that I came up with:

app/Http/Controllers/Models/LinkController.php (Maybe I should actually check more characters like &, #, and =. For now, it just does / and ?.)

         $duplicates = $link->searchDuplicateUrls();
         if ($duplicates->isNotEmpty()) {
             $msg = trans('link.duplicates_found');
 
+            $anyDuplicates = false;
             foreach ($duplicates as $duplicateLink) {
+                // the query locally trims off any /? and then does a LIKE "$trimmedurl%" search
+                // so any results in the database might be "$trimmedurl/anotherpage/"
+                // so check here if the links are actually *equal* after trimming both and ignore prefix matches
+                if (trim($duplicateLink->url, '/?') !== trim($link->url, '/?')) {
+                    continue;
+                }
+                $anyDuplicates = true;
                 $msg .= sprintf(
                     ' <a href="%s">%s</a>,',
                     route('links.show', [$duplicateLink->id]),
                     $duplicateLink->shortUrl()
                 );
             }
 
-            flash(trim($msg, ','), 'warning');
+            if ($anyDuplicates) {
+                flash(trim($msg, ','), 'warning');
+            }
         }

While testing if this works, I noticed a bug where you can insert data with leading or trailing whitespace into the database, which is just plain ugly and not a correct URL. There is no unique key in the database (imo that would be a good addition), but the unique rule applied in Laravel ignores the whitespace so this bug does not affect duplicate detection because Laravel still notices the unique constraint violation. Still, to resolve that whitespace ends up in the database, this patch can be used:

app/Http/Requests/Models/LinkStoreRequest.php

+    public function all($keys=NULL) {
+        // from https://laradevtips.com/2016/06/26/how-to-trim-request-input-data-in-laravel/
+        // but it seems our newer(?) laravel needs a $keys argument to be supported, defaulting to null
+        $data = parent::all($keys);
+        $data['url'] = trim($data['url']);
+
+        return $data;
+    }

This function gets called when app/Http/Controllers/Models/LinkController.php:store() calls
LinkRepository::create($request->all(), true);
Apparently, Laravel calls the store() method in a controller when you POST to a route assigned to that controller, so that's how the code gets triggered. Come to think of it, this might not get triggered when editing a link; I have not tried that yet.

Summing up, these patches:

strip leading and trailing whitespace from URLs before they are added to the database or used for duplicate detection
should detect, e.g.:
- example.org/cookies as a duplicate of example.org/cookies/
- example.org/cookies/ as a duplicate of example.org/cookies
- example.org/cookies/? as a duplicate of example.org/cookies
should ignore, that is, not issue warnings for:
- example.org/cookies/?flavor=chocolate as a duplicate of example.org/cookies/?flavor=triplechoc
- example.org/cookies/2 as a duplicate of example.org/cookies/

I should probably add tests for this...

Another nice-to-have would be detecting when you add the http: link when you already have the https: link. I guess that can be done by replacing @^https://@ with http:// and vice versa and adding that as an OR clause to the query in searchDuplicateUrls()?

Oct 01 '23 01:10 chrissawyerfan4

As you can both see, it's a bit more complicated to get this right. 😅 The big problem is that query parameters can be critical to identify unique sites, but they can also be useless trash like utm parameters or stuff like fbclickid. While I like the solution you suggested, sites like YouTube would drown in missed duplicates because they need query params. Not knowing which query params are needed and which are trash makes this nearly impossible to get right with this solution.

The current solution ignores query params, which is obviously not correct. But I can't think of any better solution at the moment.

Oct 03 '23 08:10 Kovah

LinkAce LinkAce copied to clipboard

Link duplication : should have the possibility to ignore some website with the dup check

LinkAce
LinkAce copied to clipboard