Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Unicode parsing

Open mm12 opened this issue 2 years ago • 9 comments

Math 'proofing' data might be good to have, although I'm not sure it needs to be done with the ML.

This could be at least partially implemented just be replacing the unicode with the name of the name of the character. Example: ∀ -> For all But could be bad english, such as "For all element in" rather than For all elements in"

https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode

mm12 avatar Feb 17 '23 02:02 mm12

Unicode as a whole needs this honestly

mm12 avatar Feb 17 '23 03:02 mm12

Do you mean something like the python unicodedata module?

import unicodedata
print(unicodedata.name("∀"))

returns

'FOR ALL'

BrianArbuckle avatar Feb 17 '23 04:02 BrianArbuckle

Do you mean something like the python unicodedata module?

import unicodedata
print(unicodedata.name("∀"))

returns

'FOR ALL'

Yeah, but as I said, there is still the issue of transforming it into proper grammar, and also the cases where symbols use a descriptive name rather than a usecase name. (ie, like is UP RIGHT DIAGONAL ELLIPSIS which doesn't help)

mm12 avatar Feb 17 '23 16:02 mm12

This is very interesting. Is there a library pacakge we can use to parse math to unicdoe? we can use a grammar fixer afterwords.

huu4ontocord avatar Feb 20 '23 13:02 huu4ontocord

If the math is in latex I see there is flatlatex.

Since grammar feels stilted for the unicodedata.name, and all the unicodedata package is a glorified dictionary, there is no reason that it could not be a better dictionary.

BrianArbuckle avatar Feb 20 '23 15:02 BrianArbuckle

general flow from what I am seeing here for the notebook (very basic) INPUT "what does this say:" + question -> detect latex -> if latex decode latex to unicode (which will be input next) -> detect non-ascii -> use unicode decoder -> try to use simple logic for grammar corrections -> VAR parsed OUTPUT answer = "this says " + parsed

(note: "this says " and "what does this say:" are placeholders for things that should be varied)

mm12 avatar Feb 20 '23 16:02 mm12

Makes sense.

BrianArbuckle avatar Feb 20 '23 17:02 BrianArbuckle

Will the input text be a token from spaCy or just the raw text?

BrianArbuckle avatar Feb 20 '23 21:02 BrianArbuckle

I have a small mock python version working, as well as a mock cython compile if needed. I'd like to understand better where it will live in the pipeline. And more specifically, where I should place the code. Also, if speed is required, I think this should be easy enough to write c with the .h file for the replacement phrases instead of a dictionary.

BrianArbuckle avatar Feb 21 '23 05:02 BrianArbuckle