greasefire icon indicating copy to clipboard operation
greasefire copied to clipboard

Automate "bad includes" handling

Open skrul opened this issue 14 years ago • 3 comments

I would love to have a way to automate updating the "bad includes" list (see https://github.com/skrul/greasefire/blob/master/java/greasefire-scraper/src/com/skrul/greasefire/Generate.java#L49) rather than playing whack-a-mole.

When building the index, I could imagine getting a few thousand popular URLs, running all the @includes against them, and if a pattern matches too many URLs just not include it.

Thoughts?

skrul avatar May 19 '11 22:05 skrul

This sounds like a good test to me. We could make random urls and test those too.

erikvold avatar May 21 '11 01:05 erikvold

Possible sources:

  • http://www.google.com/adplanner/static/top1000/
  • http://www.alexa.com/topsites

supahgreg avatar May 22 '11 21:05 supahgreg

I don't think the problems are scripts that appear on many popular pages. For popular pages there are usually a lot of scripts, so some strange scripts that want to appear on all of them are usually buried somewhere deep within this list. The problem are scripts like this, just like the ones covered by the badIncludesList. As it was already suggested here I think a good RegEx can help. I modified the Generate-file and had it ran against all available scripts (2011-10-09) and had all the scripts and includes printed out, that were blocked by the RegEx from comment 6 but were not blocked by the current badIncludesList. Here is a list with all the include patters (383) and here is the comprehensive list of all the scripts (249) (both links partially NSFW), that have been blocked with all their includes and the includes which caused them to be blocked in bold. There are a few cases cases where legitimate scripts have been blocked as well, but this is not much of an issue. The index I generated using the RegEx works great for me. I got rid of a lot of scripts that have been annoying me for quite some time. But of cause it could still be fine-tuned. So my suggestion is to implement this RegEx, use it for a while and let's see what problems still persist, which need to be taken care of separately. Frog23

frog23 avatar Oct 15 '11 11:10 frog23