ProvingGround icon indicating copy to clipboard operation
ProvingGround copied to clipboard

Increasing range of terms parsable into MathExpr

Open sundararajan-s opened this issue 5 years ago • 9 comments

The following is a list of issues which need to be fixed/improved on in the parser for MathExpr

  • [ ] Issues relating to the verb of the clause being present in the TeX expression, i.e. a divides b being written as a|b etc. These result in the entire tree being wrongly parsed.
  • [ ] Handling inferences like sentences starting with 'hence', 'thus', 'therefore', etc. This leads to only the part including the specific word to be unparsed, the rest of the tree parses correctly.
  • [ ] Parsing 'There exists' and 'For all' statements. Both of these results in a part that is unparsed.
  • [ ] Parsing statements which include lines such as 'consider the', i.e. sentences which construct some element. Here the main part is parsed, and the 'consider' part does not parse, similar to sub-issue 1.
  • [ ] Handling sentences which reference the prover, such as those starting with "We can see that". Here most of the tree is unparsed, however this might be due to the fact that the sentence is not a logical assertion.
  • [x] A specific instance of verbs being inside the TeX expression usually in 'if' statements, but also sometimes in other places.
  • [x] Exceptions raised by using the words ‘each’, ‘those’, ‘these’ or ‘both’. This is simply fixed by adding the cases in the corresponding unapply function or by reformatting input sentences. These raise a MatchException at provingground.translation.MathExpr$Determiner$.apply
  • [ ] Certain issues with adverbs, e.g. is strictly smaller than not parsing, but is smaller than parsing. This results in one part not parsing which leads to the entire tree not parsing. Upon further investigation, this seems to be a problem only with the word 'strictly' and 'also' not being parsed in the same manner as other adverbs.
  • [x] Add capability to handle conjunct and disjunct adjectives. Currently results in the adjective not being correctly parsed, the rest of the expression parses well.
  • [ ] Add support for the specific types of sentences mentioned, i.e. simple declarative sentences, assumptions, assertions, variable type specifications and alternate notation specification. Alternatively this may be achieved through the implementation of blocks.

sundararajan-s avatar Aug 27 '20 07:08 sundararajan-s

  1. Does your first point mean we extend the MathExpr language (to allow contexts?)
  2. At present tex expressions are always assigned the part of speech "proper noun". For a|b etc we need a different part of speech (I do not know if a single word type is possible here).
  3. As you may have noticed, many cases are handled by pre-processing, ideally by merging tokens and assigning a correct part-of-speech tag.
  4. As you go along, please give examples and precise errors for the cases; e.g. unparsed, wrongly parsed, or part wrongly parsed so the whole is unparsed.

siddhartha-gadgil avatar Aug 27 '20 10:08 siddhartha-gadgil

  1. That is one possibility. I was thinking it would be simpler to just drop the specific words. I think this will be needed in the subsequent step, the conversion from MathExpr to HoTT.
  2. I do not think so either. There are certain simple cases for which a fix is possible, I shall experiment with those. If it does not have any issues I may temporarily add those.
  3. Actually I did not notice much preprocessing. The preprocessing in the TeXParsed class is commented out, and besides that I did not find any preprocessing.
  4. I shall do that. I shall edit the original issue with those.

sundararajan-s avatar Aug 27 '20 10:08 sundararajan-s

  • The language should be extended if and only if the meaning of the sentence cannot be expressed. Otherwise one changes the parsing.
  • The POS tags are modified in a few cases. I think "such that" is replaced with where. There isn't much preprocessing because there isn't much of anything specific.

siddhartha-gadgil avatar Aug 27 '20 10:08 siddhartha-gadgil

  • In that case I don't think the language will need to be extended for that issue. However for the adverb issue will require an extension to the language.

  • The substitution was commented out, I shall re-enable it and see the results.

sundararajan-s avatar Aug 27 '20 11:08 sundararajan-s

If it was commented out it probably is unnecessary due to a change somewhere, either my code or the Stanford parser.

siddhartha-gadgil avatar Aug 27 '20 12:08 siddhartha-gadgil

Added new sub issue regarding conjunct adjectives.

sundararajan-s avatar Sep 02 '20 12:09 sundararajan-s

The sub-issue regarding verbs inside TeX expressions has been solved by replacing the specific TeX expression, for example, $a > b$ with "$a > b$ is true". The correct TeX expression is selected by iterating over all possible swaps and checking which one parses.

sundararajan-s avatar Feb 15 '21 05:02 sundararajan-s

Nice. So every LaTeX expression is a noun. We need rules for adding "is true", but these should be simple to some extent, and amenable to machine learning

On Mon, 15 Feb 2021 at 10:50, sundararajan-s [email protected] wrote:

The sub-issue regarding verbs inside TeX expressions has been solved by replacing the specific TeX expression, for example, $a > b$ with "$a > b$ is true". The correct TeX expression is selected by iterating over all possible swaps and checking which one parses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/siddhartha-gadgil/ProvingGround/issues/289#issuecomment-778947196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3K3JGURSBEXVX62FW65RDS7CVIXANCNFSM4QMUXKGA .

siddhartha-gadgil avatar Feb 15 '21 10:02 siddhartha-gadgil

For now, I'm doing an exhaustive search, but I do think in the future we could speed it up with some NLP methods.

sundararajan-s avatar Feb 16 '21 05:02 sundararajan-s