SmokeDetector icon indicating copy to clipboard operation
SmokeDetector copied to clipboard

Add reason newly posted youtube video and medium posts

Open user12986714 opened this issue 4 years ago • 10 comments

This PR tries to catch newly posted youtube videos.

user12986714 avatar Sep 02 '20 04:09 user12986714

Nice idea, have you tested the code?

ghost avatar Sep 03 '20 00:09 ghost

@Daniil-M-beep I would be happy if someone can help me to test the code. The problem is that I am not capable of creating useful test accounts.

user12986714 avatar Sep 03 '20 03:09 user12986714

@user12986714 Ok. I'll try and organise some testing later today.

ghost avatar Sep 03 '20 08:09 ghost

@user12986714 Apologies but it might not be today but I'll try and get it done relatively soon.

ghost avatar Sep 03 '20 09:09 ghost

This has been tested by me and neither of the 2 rules work.

ghost avatar Sep 06 '20 11:09 ghost

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

  • The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)
  • youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

NobodyNada avatar Sep 12 '20 05:09 NobodyNada

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

  • The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)
  • youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

user12986714 avatar Sep 13 '20 00:09 user12986714

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

@user12986714 Those are both true. However, we've had to do API key deployment in the past (e.g. for Perspective), and it's pretty simple:

  1. Add a new config entry for the API key, but make sure Smokey still runs fine without it (just with that rule disabled). That way, test instances/instances that haven't updated the API key will still work.
  2. Update config.sample with a placeholder, and update the config on Keybase with the real key. That way, all future instances will include the key.
  3. Send a message in the runner Keybase chat reminding the runners to add the key to their existing instances.

As far as extensibility goes: yes, we're writing special code to use the YouTube API, but that doesn't stop us from using regexes on Medium or Blogspot. I'm worried about using regex specifically on YouTube because YT is not scraper-friendly and I would prefer to stay out of that cat-and-mouse game.

NobodyNada avatar Oct 04 '20 18:10 NobodyNada

I've just taken a closer look at the Medium one, too. I'm concerned by the class="bh bi at au av aw ax ay az ba fu bd bl bm", as obfuscated classes like that are usually anti-adblock measures and are therefore periodically randomized. (Also, I'm just a little bit concerned by the regex; what if someone requests the page from Europe and gets a date of 4 Oct instead of Oct 4?)

However, there's a MUCH easier way to get the date out of a Medium post. Every Medium post has the following meta-tag:

<meta data-rh="true" property="article:published_time" content="[an ISO 8601 timestamp]">

Finally, since we already have BeautifulSoup, I'd suggest using that instead of regex to parse HTML, as it will be easier and more reliable.

NobodyNada avatar Oct 04 '20 18:10 NobodyNada

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

stale[bot] avatar Nov 08 '20 10:11 stale[bot]