greasefire
greasefire copied to clipboard
Automate "bad includes" handling
I would love to have a way to automate updating the "bad includes" list (see https://github.com/skrul/greasefire/blob/master/java/greasefire-scraper/src/com/skrul/greasefire/Generate.java#L49) rather than playing whack-a-mole.
When building the index, I could imagine getting a few thousand popular URLs, running all the @includes against them, and if a pattern matches too many URLs just not include it.
Thoughts?
This sounds like a good test to me. We could make random urls and test those too.
Possible sources:
- http://www.google.com/adplanner/static/top1000/
- http://www.alexa.com/topsites
I don't think the problems are scripts that appear on many popular pages. For popular pages there are usually a lot of scripts, so some strange scripts that want to appear on all of them are usually buried somewhere deep within this list. The problem are scripts like this, just like the ones covered by the badIncludesList. As it was already suggested here I think a good RegEx can help. I modified the Generate-file and had it ran against all available scripts (2011-10-09) and had all the scripts and includes printed out, that were blocked by the RegEx from comment 6 but were not blocked by the current badIncludesList. Here is a list with all the include patters (383) and here is the comprehensive list of all the scripts (249) (both links partially NSFW), that have been blocked with all their includes and the includes which caused them to be blocked in bold. There are a few cases cases where legitimate scripts have been blocked as well, but this is not much of an issue. The index I generated using the RegEx works great for me. I got rid of a lot of scripts that have been annoying me for quite some time. But of cause it could still be fine-tuned. So my suggestion is to implement this RegEx, use it for a while and let's see what problems still persist, which need to be taken care of separately. Frog23