url-tracking-stripper icon indicating copy to clipboard operation
url-tracking-stripper copied to clipboard

Move configuration out of source into a data file

Open wumpus opened this issue 6 years ago • 4 comments

Right now your configuration list of cgi args is expressed as source. This certainly gets the job done, but, it might be a lot prettier to move the list into a data file. And then it would be easier to share the list among projects, like my crawler in Python (hidden agenda alert :-))

For a format, it could be YAML, or this is a simpler way:

# name
# example
prefix name
prefix name

so

# Adobe ColdFusion
# https://techcrunch.com/?CFID=8494701&CFTOKEN=56974155
CF ID
CF TOKEN

In this example 'CF' is what your code calls the ROOT.

This format also makes diffs more useful, in that each line has the full meaning. So if I this line in a diff

+utm_ expid

I know at a glance the full name of the CGI arg that will be matched.

wumpus avatar May 16 '18 20:05 wumpus

I'm very likely going to move away from the regular expression param replacement approach to something else shortly, so that would be a good time to move not only from this format, but to pull out of the code and into its own home as well. I will keep that in mind during the change. Thanks!

newhouse avatar May 16 '18 20:05 newhouse

So I tested moving from a RegEx method of query param removal to a URL parse method that uses searchParams.delete(), but it was roughly 10x slower, so I don't think that's a good option. 😢

Flattening out the regular expressions (and not using root + suffixes.join('|')) was about 3x slower it looks like. I don't want to give up any performance, but in the name of supporting custom user-entered trackers in the future, I may still have to go that route. Stay tuned.

newhouse avatar May 19 '18 00:05 newhouse

You don't need to change the algorithm to have the config file, just wanted to be sure we agree the issues are separate.

For the algorithm, does it run once per click or does it run against every url on every page viewed? It seems to be the first, and if so it doesn't run that often and I don't think performance is that important.

wumpus avatar May 20 '18 18:05 wumpus

The huge pullreq https://github.com/newhouse/url-tracking-stripper/pull/71 and the request to let users add custom things to strip https://github.com/newhouse/url-tracking-stripper/issues/54 both cry out for this change to be done.

wumpus avatar Sep 03 '18 17:09 wumpus