Phamhilator
Phamhilator copied to clipboard
Please scrap Mathjax from all posts
Pham does not contain any filters specific to mathjax blocks. The only issue I can see this causing is false positives for some regexes such as {0,80}
.
Remove completely, or convert it to text? The former only requires pairing $ on sites that support mathjax, but...
How do you tell if a site supports mathjax? Hardcoding it is an option, but the same information is available through the API as well.
On Wed, Jan 28, 2015 at 6:40 PM, Mooseman [email protected] wrote:
Pham does not contain any filters specific to mathjax blocks. The only issue I can see this causing is false positives for some regexes such as {0,80}.
— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61.
Just some data of the post that prompted this. Rendered output:
Log entry:
{
"ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223",
"PostUrl" : "http://math.stackexchange.com/a/1123742",
"Site" : "math.stackexchange.com",
"Title" : "If $\\alpha_1||y_1||\\alpha_2||y_2||$, then $x=-y_1$.",
"Body" : "<p>If $\\alpha_1||y_1||>\\alpha_2||y_2||$, then $x=-y_1$.</p>",
"TimeStamp" : "2015-01-28T17:34:00.918Z",
"ReportType" : "LowQuality",
"BlackTerms" : [
{
"Type" : "AnswerLQ",
"Regex" : "^(?i).{0,80}$",
"IsAuto" : false,
"Site" : "",
"Score" : 89,
"TPCount" : 486,
"FPCount" : 119,
"CaughtCount" : 3010
}
],
"WhiteTerms" : []
}
... which begs the question, do we want to classify these sorts of posts as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and lower that term's weight for mathjax supporting sites (and optionally add another term for posts with a lower char count). Or...?
If the question or answer doesn't have enough content besides the mathjax, I think it will generally be LQ.
Seems LQ to me, but I'm not sure it needs our handling. The auto-whitelist should be able to handle that if we don't. I've never been a fan of that regex, actually.
On Wed, Jan 28, 2015 at 7:02 PM, Sam [email protected] wrote:
Just some data of the post that prompted this. Rendered output:
https://camo.githubusercontent.com/bf4437ab6ef39b3a226ca95d34d777fb8d3bd342/687474703a2f2f692e737461636b2e696d6775722e636f6d2f5939724b552e706e67
Log entry:
{ "ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223", "PostUrl" : "http://math.stackexchange.com/a/1123742", "Site" : "math.stackexchange.com", "Title" : "If $\alpha_1||y_1||\alpha_2||y_2||$, then $x=-y_1$.", "Body" : "
If $\alpha_1||y_1||>\alpha_2||y_2||$, then $x=-y_1$.
", "TimeStamp" : "2015-01-28T17:34:00.918Z", "ReportType" : "LowQuality", "BlackTerms" : [ { "Type" : "AnswerLQ", "Regex" : "^(?i).{0,80}$", "IsAuto" : false, "Site" : "", "Score" : 89, "TPCount" : 486, "FPCount" : 119, "CaughtCount" : 3010 } ], "WhiteTerms" : [] }... which begs the question, do we want to classify these sorts of posts as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and lower that term's weight for mathjax supporting sites (and optionally add another term for posts with a lower char count). Or...?
— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71883345 .
So... it looks like just a simple matter of adjusting the current terms. Should I continue to add mathjax scrapping then?
Please do. Not sure if it's strictly necessary, but it should be helpful.
On Wed, Jan 28, 2015 at 7:44 PM, Sam [email protected] wrote:
So... it looks like just a simple matter of adjusting the current terms. Should I continue to add mathjax scraping then?
— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71890780 .
Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain text?
I'd remove it so we don't match phone numbers or other filters.
If you feel like parsing mathjax... sure, go ahead. Be sure to keep the main code clean, though.
On Wed, Jan 28, 2015 at 8:05 PM, Sam [email protected] wrote:
Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain text?
— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71894653 .
Alright, well I'm sure there's a library for that (I hope). Will do (I'm gonna put this as low priority until Pham's stable after the switch over to a CLI).