Unicode parsing
Math 'proofing' data might be good to have, although I'm not sure it needs to be done with the ML.
This could be at least partially implemented just be replacing the unicode with the name of the name of the character.
Example: ∀ -> For all
But could be bad english, such as "For all element in" rather than For all elements in"
https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode
Unicode as a whole needs this honestly
Do you mean something like the python unicodedata module?
import unicodedata
print(unicodedata.name("∀"))
returns
'FOR ALL'
Do you mean something like the python unicodedata module?
import unicodedata print(unicodedata.name("∀"))returns
'FOR ALL'
Yeah, but as I said, there is still the issue of transforming it into proper grammar, and also the cases where symbols use a descriptive name rather than a usecase name. (ie, like ⋰ is UP RIGHT DIAGONAL ELLIPSIS which doesn't help)
This is very interesting. Is there a library pacakge we can use to parse math to unicdoe? we can use a grammar fixer afterwords.
If the math is in latex I see there is flatlatex.
Since grammar feels stilted for the unicodedata.name, and all the unicodedata package is a glorified dictionary, there is no reason that it could not be a better dictionary.
general flow from what I am seeing here for the notebook (very basic) INPUT "what does this say:" + question -> detect latex -> if latex decode latex to unicode (which will be input next) -> detect non-ascii -> use unicode decoder -> try to use simple logic for grammar corrections -> VAR parsed OUTPUT answer = "this says " + parsed
(note: "this says " and "what does this say:" are placeholders for things that should be varied)
Makes sense.
Will the input text be a token from spaCy or just the raw text?
I have a small mock python version working, as well as a mock cython compile if needed. I'd like to understand better where it will live in the pipeline. And more specifically, where I should place the code. Also, if speed is required, I think this should be easy enough to write c with the .h file for the replacement phrases instead of a dictionary.