Jonathan Keller
Jonathan Keller
I *really* like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may...
> I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms....
I've just taken a closer look at the Medium one, too. I'm concerned by the `class="bh bi at au av aw ax ay az ba fu bd bl bm"`, as...
Just for reference: Thomas Ward is also working on an [alternative implementation]( of this using BeautifulSoup to parse the HTML.
> A stat with 12405 fp posts on MS That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.
@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?
> W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything. Not necessarily in this case —...
I ran some more tests today. It's looking a lot better, but we still have problems with: 1) Code. We probably should strip code blocks, but then we'll still have...
> Excludes StackOverflow, Maths, Mathoverflow and Cross Validated from the "post is mostly images" reason (since StackOverflow can have code counted and the other 3 have a lot of MathJax...
I use Safari, and this happens to me as well.