Phamhilator icon indicating copy to clipboard operation
Phamhilator copied to clipboard

Please scrap Mathjax from all posts

Open ghost opened this issue 10 years ago • 10 comments

Pham does not contain any filters specific to mathjax blocks. The only issue I can see this causing is false positives for some regexes such as {0,80}.

ghost avatar Jan 28 '15 17:01 ghost

Remove completely, or convert it to text? The former only requires pairing $ on sites that support mathjax, but...

How do you tell if a site supports mathjax? Hardcoding it is an option, but the same information is available through the API as well.

On Wed, Jan 28, 2015 at 6:40 PM, Mooseman [email protected] wrote:

Pham does not contain any filters specific to mathjax blocks. The only issue I can see this causing is false positives for some regexes such as {0,80}.

— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61.

honnza avatar Jan 28 '15 17:01 honnza

Just some data of the post that prompted this. Rendered output:

Log entry:

{
    "ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223",
    "PostUrl" : "http://math.stackexchange.com/a/1123742",
    "Site" : "math.stackexchange.com",
    "Title" : "If $\\alpha_1||y_1||\\alpha_2||y_2||$, then $x=-y_1$.",
    "Body" : "<p>If $\\alpha_1||y_1||>\\alpha_2||y_2||$, then $x=-y_1$.</p>",
    "TimeStamp" : "2015-01-28T17:34:00.918Z",
    "ReportType" : "LowQuality",
    "BlackTerms" : [
        {
            "Type" : "AnswerLQ",
            "Regex" : "^(?i).{0,80}$",
            "IsAuto" : false,
            "Site" : "",
            "Score" : 89,
            "TPCount" : 486,
            "FPCount" : 119,
            "CaughtCount" : 3010
        }
    ],
    "WhiteTerms" : []
}

... which begs the question, do we want to classify these sorts of posts as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and lower that term's weight for mathjax supporting sites (and optionally add another term for posts with a lower char count). Or...?

ArcticEcho avatar Jan 28 '15 18:01 ArcticEcho

If the question or answer doesn't have enough content besides the mathjax, I think it will generally be LQ.

ghost avatar Jan 28 '15 18:01 ghost

Seems LQ to me, but I'm not sure it needs our handling. The auto-whitelist should be able to handle that if we don't. I've never been a fan of that regex, actually.

On Wed, Jan 28, 2015 at 7:02 PM, Sam [email protected] wrote:

Just some data of the post that prompted this. Rendered output:

https://camo.githubusercontent.com/bf4437ab6ef39b3a226ca95d34d777fb8d3bd342/687474703a2f2f692e737461636b2e696d6775722e636f6d2f5939724b552e706e67

Log entry:

{ "ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223", "PostUrl" : "http://math.stackexchange.com/a/1123742", "Site" : "math.stackexchange.com", "Title" : "If $\alpha_1||y_1||\alpha_2||y_2||$, then $x=-y_1$.", "Body" : "

If $\alpha_1||y_1||>\alpha_2||y_2||$, then $x=-y_1$.

", "TimeStamp" : "2015-01-28T17:34:00.918Z", "ReportType" : "LowQuality", "BlackTerms" : [ { "Type" : "AnswerLQ", "Regex" : "^(?i).{0,80}$", "IsAuto" : false, "Site" : "", "Score" : 89, "TPCount" : 486, "FPCount" : 119, "CaughtCount" : 3010 } ], "WhiteTerms" : [] }

... which begs the question, do we want to classify these sorts of posts as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and lower that term's weight for mathjax supporting sites (and optionally add another term for posts with a lower char count). Or...?

— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71883345 .

honnza avatar Jan 28 '15 18:01 honnza

So... it looks like just a simple matter of adjusting the current terms. Should I continue to add mathjax scrapping then?

ArcticEcho avatar Jan 28 '15 18:01 ArcticEcho

Please do. Not sure if it's strictly necessary, but it should be helpful.

On Wed, Jan 28, 2015 at 7:44 PM, Sam [email protected] wrote:

So... it looks like just a simple matter of adjusting the current terms. Should I continue to add mathjax scraping then?

— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71890780 .

honnza avatar Jan 28 '15 19:01 honnza

Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain text?

ArcticEcho avatar Jan 28 '15 19:01 ArcticEcho

I'd remove it so we don't match phone numbers or other filters.

ghost avatar Jan 28 '15 19:01 ghost

If you feel like parsing mathjax... sure, go ahead. Be sure to keep the main code clean, though.

On Wed, Jan 28, 2015 at 8:05 PM, Sam [email protected] wrote:

Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain text?

— Reply to this email directly or view it on GitHub https://github.com/ArcticEcho/Phamhilator/issues/61#issuecomment-71894653 .

honnza avatar Jan 28 '15 19:01 honnza

Alright, well I'm sure there's a library for that (I hope). Will do (I'm gonna put this as low priority until Pham's stable after the switch over to a CLI).

ArcticEcho avatar Jan 28 '15 19:01 ArcticEcho