greek-adblockplus-filter icon indicating copy to clipboard operation
greek-adblockplus-filter copied to clipboard

List spring cleaning

Open pappasadrian opened this issue 4 years ago • 4 comments

Hello,

I was taking a brief look through the filter list. Considering that this is a project ongoing for more than 10 years, it might be worth performing some sort of spring cleaning, to identify and remove:

  1. domains no longer in operation
  2. site-specific rules blocking items that no longer exist on the site
  3. check for duplicates/redundant stuff

I can't think of a good methodology for this, other than manually checking.

Any thoughts and ideas? Is this even necessary? I was thinking that reducing the rules might help with performance of adblockers? IDK.

pappasadrian avatar Feb 23 '21 18:02 pappasadrian

good idea! even if it's not strictly necessary, it's always good to clean up things every now and then

  1. domains not in operation should be easy to spot. just parse the domain name out of a rule and see if it resolves. if it does, keep it. if it doesn't remove it.
  2. + 3. far more difficult, especially the element hiding rules

kargig avatar Feb 25 '21 16:02 kargig

for #92, I got (most of) the domains out of the rules file with:

cat void-gr-filters.txt| grep -Po '.*?(//|\|\||@@\|\||@@|\~)\K.*?(?=/|#)' | sort | uniq > domain-list.txt
cat void-gr-filters.txt| grep -Po '^[0-9a-zA-Z].*?(?=/|#)' | sort | uniq >> domain-list.txt
sort domain-list.txt | uniq > domain-list-final.txt

and then ran this:

#!/bin/bash

DOMAIN_LIST="domain-list-final.txt"
#DOMAIN_LIST="testme.txt"
RESOLVER="1.1.1.1"
BAD_DOMAINS="bad_domains"
SUB_NO_RECORD="no_record"
WWW_EXISTS="www_exists"

rm -f "${BAD_DOMAINS}" "${SUB_NO_RECORD}" "${WWW_EXISTS}"

while read -r line; do
  # cleanups
  myline=$(echo "${line}" | awk -F':' '{ print $1 }')
  line=$(echo "${myline}" | grep -Ev '/|\|' | grep -Ev '^[0-9]')
  if [ "x${line}" = "x" ]; then
    continue
  fi
  echo "Working on: ${line}"
  # Check if the subdomain exists
  if [ "$(dig "${line}" @${RESOLVER} +short)" = "" ]; then
  # Check if the subdomain with www prepended exists
    if [ "$(dig "www.${line}" @${RESOLVER} +short)" = "" ]; then
      domain=$(echo "${line}" | awk -F. '{ print $(NF-1) "." $NF }')
      # if the domain doesn't have NS records, the domain does not exist any more
      if [ "$(dig NS "${domain}" @${RESOLVER} +short)" = "" ]; then
        echo "${domain}" | tee -a "${BAD_DOMAINS}"
      # if the entry is a subdomain we already know it doesn't have A record
      elif [ "$(echo "${line}" | grep -o '\.' | wc -l)" -gt "1" ]; then
          echo "${line}" | tee -a "${SUB_NO_RECORD}"
      fi
    else
      echo "${line}" | tee -a "${WWW_EXISTS}"
    fi
  fi
done < "${DOMAIN_LIST}"


double checked all "bad_domains" manually

kargig avatar Feb 27 '21 08:02 kargig

@kargig Good stuff!

aldi avatar Feb 28 '21 16:02 aldi

redundant stuff

Cosmetic filter have network start characters: https://github.com/kargig/greek-adblockplus-filter/commit/044bc9ff72118bd2b585606d21ae8afcc9251226 (made in 2016)

https://github.com/kargig/greek-adblockplus-filter/blob/72bccd07ccfc3b469fec2a47b1f2aec073c79277/void-gr-filters.txt#L432-L433

AdGuard disabled use in 2018: https://github.com/AdguardTeam/FiltersRegistry/commit/a452d4dcefdecaf4710f8056e11ecacd23fc73e1#diff-6472c0fcd53f81660278097de5b81a5a1cd70c38b8a5068d02039207a61d5726R93-R95

krystian3w avatar Jun 17 '22 05:06 krystian3w