AdNauseam icon indicating copy to clipboard operation
AdNauseam copied to clipboard

Ads hidden but not collected

Open dhowe opened this issue 8 years ago • 33 comments

These need to be addressed (note: this is a good exercise for new developers wishing to understand how ad-parsing works):

  • [x] nytimes.com
  • [x] thepiratebay.org

A checklist of ads collecting for Chinese ads

  • [x] 163.com
  • [x] sohu.com
  • [x] sina.com
  • [ ] zol.com.cn
  • [ ] weibo.com
  • [ ] qq.com (background ads)
  • [ ] baidu.com (text ads)

hk popular site:

  • [ ] wmoov.com

dhowe avatar Aug 22 '16 01:08 dhowe

thepiratebay.org finished with #363

dhowe avatar Sep 05 '16 08:09 dhowe

ads on nytimes.com are not captured, because the script generating them has been blocked by EasyPrivacy. Partially solved by adding dynamicFilteringString to allow the request.https://github.com/dhowe/AdNauseam/issues/399

cqx931 avatar Sep 14 '16 09:09 cqx931

Now AdNauseam can collect ads images from iframes with external src(in this case, iframe[src*="googlesyndication.com"]). However, for the iframes whose src looks like this: screen shot 2016-09-20 at 5 07 28 pm , I can only see the process message of the iframe, but not any content inside. Could this be some problems with injecting content script into those iframes?

cqx931 avatar Sep 20 '16 09:09 cqx931

is this still relevant?

dhowe avatar Sep 20 '16 12:09 dhowe

yes, this issue still exists.

cqx931 avatar Sep 20 '16 12:09 cqx931

can you test whether our content scripts are being (dynamically) injected?

dhowe avatar Sep 20 '16 12:09 dhowe

I can see a few inject messages from background.html but I don't know which iframe it is referring to. How can I know the exact element from the frameId? screen shot 2016-09-20 at 9 00 39 pm

cqx931 avatar Sep 20 '16 13:09 cqx931

I'm not exactly sure, as you've guessed those last 2 numbers are tabId / frameId, the URL is for the page itself -- you can also get more info by printing/debugging the pageStore:

screen shot 2016-09-21 at 12 16 43 pm

dhowe avatar Sep 21 '16 03:09 dhowe

Then the content scripts are only (dynamically) injected into those iframes with an external src, not to those with src"javascript:'<html><body></body></html>'".

cqx931 avatar Sep 21 '16 09:09 cqx931

Is that true? It should be any iframes that don't exist originally on the page, but are created (regardless of source)

dhowe avatar Sep 21 '16 10:09 dhowe

http://www.si.com/ ad iframe won't be dynamically created when the outside ad-container is invisible. Therefore, no ads can be found when AdNauseam hide all the ad-container. Once the cosmetic filter is toggled on, ads can be collected.

cqx931 avatar Sep 22 '16 03:09 cqx931

so we need to not block the outer ad-container, correct?

we can do this either by adding the rule to the disabledRules list, or by creating an exception rule in adnauseam.txt

dhowe avatar Sep 22 '16 03:09 dhowe

yes. I think an exception rule for ad-container in si.com would be good, as this is a quite wide selector.

cqx931 avatar Sep 22 '16 03:09 cqx931

great, make it so..

dhowe avatar Sep 22 '16 04:09 dhowe

fixed si.com with https://github.com/dhowe/AdNauseam/pull/433

cqx931 avatar Sep 22 '16 07:09 cqx931

 👍

dhowe avatar Sep 22 '16 08:09 dhowe

(list in progress) ENGLISH checklist of ads hidden but not collected

  • [ ] si.com
  • [ ] cnn.com (2 hidden, 0 collected)
  • [ ] http://www.forbes.com/ (0 collected)
  • [ ] http://www.sfgate.com/ (only 'today's deals' collected)
  • [ ] http://www.theatlantic.com/ (unreliable, collects between 0-4/4)
  • [ ] http://nypost.com/

leoneckert avatar Sep 24 '16 22:09 leoneckert

A fresh update of nytimes...At least for today, some of their ad images(Ex:TopLeft) don't have a parent tag. Instead they have this interesting way of writing onclick attribute for their ad images...And when you click it, it doesn't lead you to anywhere... screen shot 2016-09-27 at 12 54 14 am

cqx931 avatar Sep 26 '16 17:09 cqx931

Not sure if I understand what you are saying: when I click, I end up at the ad site (or do you mean when adn clicks?)

screen shot 2016-09-27 at 1 14 42 am

dhowe avatar Sep 26 '16 17:09 dhowe

Interesting...if I click the ad without any blocker on, I also end up in the ad site. But when I have AdNauseam on, it goes through a quick process of opening a new page and closing it instantly...So AdNauseam must be blocking something necessary for this to run.

If this is the case...are we going to parse this onClick?It seems like NYTimes is calling their own objects and functions to do this...at least for this ad.

Or can we ignore the target URL for special cases?

cqx931 avatar Sep 26 '16 17:09 cqx931

Do you mean that you disable cosmetic filters for the page, then view and click the ads?

If so, I notice that adding the single dynamic filter below solves the problem:

nytimes.com serving-sys.com * allow

dhowe avatar Sep 26 '16 18:09 dhowe

Note I've also updated the parseOnClick() code to handle this case, see 54ae3fc7

dhowe avatar Sep 26 '16 22:09 dhowe

Question: Do we need text-ad parser for Chinese search engines? I can work on that.

cqx931 avatar Sep 28 '16 07:09 cqx931

yeah, that would be good, at least for the 2 or 3 most popular...

dhowe avatar Sep 28 '16 08:09 dhowe

  • [x] yahoo ads were hidden but not collected because internal link. Fixed in https://github.com/dhowe/AdNauseam/pull/464

leoneckert avatar Sep 28 '16 23:09 leoneckert

  • [ ] http://www.accuweather.com/ there is massive ads all over, which get hidden, but do not show up in parser.js console nor are they collected. Are those the so called dynamically created iframes?

leoneckert avatar Sep 29 '16 17:09 leoneckert

They show up in the parser console for me (and match exactly what I see in the logger):

screen shot 2016-09-30 at 6 32 53 am

screen shot 2016-09-30 at 6 33 32 am

Did you enable the debugging flag?

dhowe avatar Sep 29 '16 22:09 dhowe

And when I disable blocking via a firewall rule (below), I see many ads being collected, which tells me that one of the blocking rules is breaking the ad collection (probably because elements we need are never making it to the page).

screen shot 2016-09-30 at 7 22 24 am

screen shot 2016-09-30 at 7 22 35 am

dhowe avatar Sep 29 '16 23:09 dhowe

pages analysed in https://github.com/dhowe/AdNauseam/issues/427 in which ads were hidden but not collected. working on this next.

  • [x] http://www.nytimes.com/
  • [x] http://www.nbcnews.com/
  • [x] https://www.yahoo.com/news
  • [ ] http://www.dailymail.co.uk/ushome/
  • [ ] https://www.washingtonpost.com
  • [ ] https://www.theguardian.com/us
  • [ ] http://www.wsj.com
  • [ ] http://abcnews.go.com
  • [ ] http://www.bbc.com/news
  • [ ] http://www.cbsnews.com
  • [ ] http://www.reuters.com
  • [ ] http://www.msnbc.com
  • [ ] http://www.cbc.ca/news
  • [ ] http://www.news.com.au
  • [ ] http://www.cnn.com

leoneckert avatar Oct 04 '16 23:10 leoneckert

^ going through above websites. Is there a proven method to definitively say what ads are not collected? What I experience often is that an ad is not collected 5 times in a row and suddenly when refreshing a 6th time it is. Result of that is that I write filters and during testing suddenly notice that it is redundant. I am aware events fire irregularly and things might be dependent on cache or if I toggle cosmetic filtering (which might activate dynamic iFrames to be generated) etc. Wondered if you have similar experiences / a rule of thumb for me.

leoneckert avatar Oct 12 '16 21:10 leoneckert