CleanLinks icon indicating copy to clipboard operation
CleanLinks copied to clipboard

Remove garbage fields matching these... (from Pure-URL addon)

Open rieje opened this issue 9 years ago • 14 comments

Pure-URL addon lets you remove garbage fields simply by specifying the strings you would like to remove from the link, separated by commas. By default, it removes the following garbage fields:

utm_source, utm_medium, utm_term, utm_content, utm_campaign, utm_reader, utm_place, ga_source, ga_medium, ga_term, ga_content, ga_campaign, ga_place, yclid, _openstat, , fb_action_ids, fb_action_types, fb_ref, fb_source, action_object_map, action_type_map, action_ref_map, , , , ,

How can I configure Clean Links to block all of these as well? Do I have to use regex rules and if so, has anyone been able to convert the garbage fields from Pure-URL to rules for Clean Links?

Thanks.

P.S. I noticed Clean Links whitelisted a lot of domains, including www.facebook,com. I was wondering if it is likely these sites will break if I removed it from the whitelist (i.e. have them cleaned as well) because apparently sites likes facebook are massive offenders of "dirty" links. In what manner do the sites break if they do?

rieje avatar Jun 26 '16 02:06 rieje

I do not understand regex at all.

I came up with this string, testing with http://regexr.com/

(?:ref|aff)\w*|ga_\w+|fb_\w+|\w*utm_\w+|(?:merchant|programme|media|)yclid|_openstat|action_object_map|action_type_map|action_ref_map|ID

Which is likely to be completely wrong ! :)

although it did appear to catch the extra PL tags. I wish it was easier to do.

GitCurious avatar Jun 30 '16 11:06 GitCurious

Thanks, I will give it a try for a while and see how it goes--hope others can do the same and report back or improve on it if necessary :)

rieje avatar Jul 06 '16 06:07 rieje

I think this should working:

(?:ref|aff)\w*|utm_\w+|(?:merchant|programme|media)ID|ga_\w*|fb_\w*|ylcid\w*|action_\w*|_openstat\w*

or

(?:ref|aff)\w*|utm_\w+|(?:merchant|programme|media)ID|(?:ga_|fb_|ylcid|action_|_openstat)\w*
regex short explanation
**` `**
(?:xxx) Group
xxx match > remove
\w* match 0 or more of the preceding token > remove
\w+ match 1 or more of the preceding token > remove
\w{x} match x of the preceding token > remove

P.S. @diegocr Please give us bigger text fields in the settings!


edit Some additional tracking keywords:

Amazon : ascsubtag|~~qid|~~bbn|tag|pf_|SubscriptionId|linkCode|camp|creative|creativeASIN

I noticed that yahoo is using qid for Yahoo!Answers as identification for the various questions. But also using their tracking ids gprid|pvid. Does anyone know what Amazon relates to qid?

Answer from stackoverflow

qid=1387193124 is a unix timestamp that the URL was generated, in this case October 3rd, 2015 at 12:40:07 GMT

I think in this case qid is not relevant.

YouTube : feature

geokis avatar Jul 28 '16 22:07 geokis

Hi guys. Could you please point out why CleanLinks does not remove "?ws_ab_test..." and what can be done to fix this? Thank you

codeshark1 avatar Aug 09 '16 14:08 codeshark1

An example link would be useful E.g:

https://de.aliexpress.com/item/TK1327/32409702785.html?ws_ab_test=searchweb201556_10,searchweb201602_5_10057_10056_10055_10049_10017_405_404_10059_10058_10040_10060,searchweb201603_7&btsid=190ca3d2-af2b-411f-abd2-2d035219b767

Try: ws_\w* To clear the link complete you need to add btsid\w* as well

Don't forget the delimiter | ws_\w*|btsid\w*


edit I would appreciate an user base trash/tracking Link Database. I think this thread is not a bad start, maybe @rieje can edit the opening post with the new links.

geokis avatar Aug 10 '16 14:08 geokis

We do need a trash/tracking database +1

GitCurious avatar Aug 10 '16 15:08 GitCurious

@geokis Thanks but I'm a total noob when it comes to regular expressions and stuff like that... My "remove from links" field reads like this (it's addon default): (?:ref|aff)\w*|utm_\w+|(?:merchant|programme|media)ID I added ws_\w_|btsid\w_ to the end, like this (?:ref|aff)\w*|utm_\w+|(?:merchant|programme|media)ID|ws_\w*|btsid\w* And apparently it did nothing

codeshark1 avatar Aug 14 '16 23:08 codeshark1

Did you check the "Use HTTP Observer" option in the settings panel ?

If I UNcheck that - then the example link above is not cleaned

GitCurious avatar Aug 15 '16 06:08 GitCurious

@GitCurious I tried with both checked and unchecked HTTP Observer, and still no luck...

codeshark1 avatar Aug 15 '16 12:08 codeshark1

@codeshark1:

I added ws_\w|btsid\w to the end, like this (?:ref|aff)\w*|utm_\w+|(?:merchant|programme|media)ID|ws_\w*|btsid\w* ...

This should work.

Alternative you can try this:

(?:ref|aff|ws_|btsid)\w*|utm_\w+|(?:merchant|programme|media)ID

but it is pretty the same.

Both works for me. In settings all checkboxes are active, except:

ingnore no-https links

Maybe you have under [Skip Links matching with:] an element that contains item ? Remove it and try again.


edit

From: #140

Yahoo Search Image: _yl\w*\w*|gprid\w*|pvid\w*

or

(?:_yl\w*|\w*id)\w*

\w*id would contain all types like xxxID (xxx = any number of characters) This caused to often problems with other sites. Stick with it:

(?:_yl\w*|gprid|pvid)\w*


edit2

Google Search: (?:gs_l|gclid|ei)\w*

Google SearchImage: ved

ebay: (?:_qi|clk_rvr_id|_trk\w*)\w*

afillinet: subid

geokis avatar Aug 15 '16 13:08 geokis

@geokis

Can you help me understand why the "groups" are not written like this:

(?:ref|aff|merchant|programme|media|ga_|fb_|ylcid|action_|_openstat) ......

In just one single group...rather than separate groups ?

GitCurious avatar Aug 16 '16 06:08 GitCurious

Hi @GitCurious

(?:merchant|programme|media)ID

This expression will only matches this kind of structure:

  • merchantID
  • programmeID
  • mediaID

!Caution CL are case-sensitive so idID

If you use this kind of expression:

(?:merchant|programme|media)\w*

It will matches:

  • merchantxxx
  • programmexxx
  • mediaxxx

(xxx = any type characters/numbers)

So the difference is that the ID expression is more specific and less dangerous to interfere with similar expression types in the URL. Both would working.

E.g.: My suggestion in the post above: gprid|pvid would also works as (?:gpr|pv)id and it is more specific than \w*id

geokis avatar Aug 16 '16 19:08 geokis

Thankyou geokis

I must be as dumb as a bag of rocks.

I did not see there was no delimiter between the closing brace and the "ID" so that explains it exactly....

so I have learned a little bit more about regex formatting :)

GitCurious avatar Aug 16 '16 20:08 GitCurious

@geokis Have you found a rule that you've stuck with that is more restrictive than the default settings and is a good balance between "cleanliness" and usability? I actually know some regex to make my own rules but the issue is I have almost zero knowledge of what kind of attributes can be blocked with no issues and what kind are essential which when removed will break functionality of the site (like shopping cart for amazon.com, for example).

If you have a more restrictive rule you can share then it would simply be a matter of using it and removing problematic ones if an issue is encountered.

rieje avatar Jan 31 '17 01:01 rieje