java-probabilistic-earley-parser icon indicating copy to clipboard operation
java-probabilistic-earley-parser copied to clipboard

ERROR CHECKING IMPLEMENTATION

Open ghost opened this issue 8 years ago • 31 comments

Can you implement an error handling?

In case of error, we can be insert correct token and using synchronizing token method. 7-parsing-error.pdf

ghost avatar Jan 25 '17 21:01 ghost

Can you help me to implement two methods for error recovery?

I need to:

insert function (insert token pag. 3) synchronizing token (pag. 4). Here you can find what I'm saying. 7-parsing-error.pdf

ghost avatar Jan 25 '17 21:01 ghost

Eh. Are you a spam bot?

digitalheir avatar Jan 25 '17 21:01 digitalheir

Oh OK my Android device wouldn't open the PDF so I thought it was a virus, but I can open it now. Let me see.

digitalheir avatar Jan 25 '17 22:01 digitalheir

This is a very nice idea. A bit what I was getting at in #1 except I didn't put any thought in it. We could definitely implement this. You may be ableto hack ittogether using the postScan function.

The slides are about shift-reduce parsers. I can't offhand say how they compare to earley error handling and probabilities

digitalheir avatar Jan 25 '17 22:01 digitalheir

Idea: implement a ParseIterator that performs the predict, scan, complete cycle based on the user-passed iterator and additionally contains a stack for tokens to insert or "delete" dynamically at runtime

digitalheir avatar Jan 25 '17 22:01 digitalheir

A very nice idea! Congratulations!

ghost avatar Jan 26 '17 08:01 ghost

I don't really understand yet how error handling should work for Earley parsers (and I don't really know about shift reduce parsers in general). Do you have any idea how this kind of error handling fits in with Earley parsing?

Like, is it when a certain token is scanned, but it can't advance any active states? That was my interpretation of it. But it can happen that the parser happily predicts, scans and completes, but only at the very end find out it can't actually make a sentence out of the states (ie. it can't complete the goal state). I don't really know how to take care of this except telling the user: hey, I tried to do this but it's not a sentence.

Is the synchronizing token idea something where you add rules to your grammar say: I know this is an error, and you deal with it in this way?

I'll need to read more about how to do this.

digitalheir avatar Jan 26 '17 22:01 digitalheir

Based on what he says to the PDF, the error handling in the Earley parser can take place in this manner. Given a context-free grammar and an input string, the parser performs its prediction cycle, scanning and completion (the classic Earley parser). Error handling is a qualcos most advanced in the sense that:

The user enters a string, the parser must say if the input string is properly generated by the grammar (if so it's all ok). If the input string has some words (tokens) that do not belong to the grammar, the parser can be taken:

  1. eliminate the wrong token (for example by replacing it with a label "#" to indicate that it is not recognized and therefore eliminated)

  2. insert in place of the wrong token, another token more likely (symbol of production of the same head with the highest probability). If every symbol has the same probability, we must use directly synchonizing token method.

ghost avatar Jan 26 '17 22:01 ghost

The token synchronizing technique is used when we can not handle the error situation. A synchronizing token may be for example a sentence recognizer end as the dot ( "."). If we can not correct the token, we arrive at a fixed point (synchronizing token), we throw it all away and start over. If we able to correct the input string, print the correct string.

ghost avatar Jan 26 '17 23:01 ghost

Do you understand what I mean?

ghost avatar Jan 30 '17 15:01 ghost

If I understand correctly: when the parser reads a sentence left to right it might encounter a token E that it cannot understand. It will go into a kind of panic mode and back up until it get to a point where it can say "ignore all tokens from here, past E, up to the synchronizing token", and resume parsing.

It is difficult to determine where the point is from where the parser can resume parsing, but the presentation that you sent suggests that the user makes add rules to the grammar that specify what these points are.

So I will add a special type of <error> terminal which allows you to write "backup" rules in case something goes wrong. The parser will then act as if the predict phase predicted the backup rule.

Sum -> Num Plus Num
Plus -> +
Num -> 1
Num -> <error>

A sentence like 1 + 1 will then generate the following parse tree:

       Sum
 /      |     \
Num    Plus   Num
 |      |     |
1       +     1

And a sentence like 1 + 2 will then generate the following parse tree:

       Sum
 /      |     \
Num    Plus   Num
 |      |     |
1       +     <error>

I'm making some progress on this.

I've already implemented two kinds of simpler error handling in #7, in case of unknown tokens: either just drop the unknown token or make it have any terminal type, and I have a pretty good idea of how to implement these synchronizing tokens.

digitalheir avatar Feb 11 '17 16:02 digitalheir

Then another point: how to insert? At what point do you decide you want to insert a token, and at what position will you insert the token?

Your source says

Burke-Fisher error repair tries every possible single-token insertion, deletion or replacement at every point in the input up to K tokens before the error is detected

That's pretty spicy, I don't know how useful this will be.

digitalheir avatar Feb 11 '17 17:02 digitalheir

A synchonizing token is for example a dot "." in a sentence. If my input string, for example, is:

The pen is not on the table. It's in the pencil case.

When parser reads it, from left to right, encounters the dot at end string and, if there is an error ("not" doesn't belong to CFG), parser ignores thr sentence until the dot and STARTS from the word after dot (starts from "it's in...). This method is very useful when isn't possible handle errors with method like insert o delete unknown tokens.

ghost avatar Feb 11 '17 19:02 ghost

Yeah, sorry my example didn't really illustrate that, but I understand. I was conceptualising how this works for Earley because it's a different kind of parser than shift-reduce parsers.

But it can probably work beautifully. The parser just needs a new category for nonlexical tokens.

digitalheir avatar Feb 11 '17 21:02 digitalheir

The other comment btw was not about synchronizing tokens but inserting tokens on error, which you said you need in your first comment. I'm thinking this may be a more application specific function? Since you can insert anything, anywhere. It's not really obvious what to do.

digitalheir avatar Feb 11 '17 22:02 digitalheir

This application can be use, for example, for text analisys in which user may correct errors and handling it by different modalities. If my CFG grammar is:

PHRASE --> SUBJECT VERB COMPLEMENT SUBJECT --> i (0.16) SUBJECT --> you (0.16) SUBJECT --> he (0.16) SUBJECT --> she (0.16) SUBJECT --> it (0.16) SUBJECT --> we (0.16) SUBJECT --> you (0.16) SUBJECT --> they (0.16) VERB --> am (0.5) VERB --> are (0.5) COMPLEMENT --> at home (0.33) COMPLEMENT --> at school (0.33) COMPLEMENT --> at the cinema (0.33)

And my input string is:

  1. We are at home

Parser doesn't detect any error because there is no one and all string words (symbols) belong to grammar written at the top.

  1. You are at the disco

Parser detects that word "disco" don't belong to grammar and can decide to substiture "disco" with another word which has high probability (according to our grammar structure). In case of equiprobable words, we can chose one of these. We can correct this string as:

You are AT THE CINEMA.

  1. You are at the cinema. You aren't at home. (the dot is synchronizing token)

If the parser can't correct errors, it can ignore all words until "." and continue analisys after it. Supposing that "You are at the cinema" can't be correct, we continue with "You aren't at home", which can be correct inserting high probably word (if has errors).

Synchronizing token can be used as "recognizer end string" like ("." , ";", etc...), symbol that can help to handle errors in input string using probabilistic earley parser.

ghost avatar Feb 14 '17 15:02 ghost

I will implement this portion hopefully this week:

image

digitalheir avatar Feb 14 '17 15:02 digitalheir

Some good references about synchronizing token:

http://people.cs.vt.edu/ryder/415/lectures/parsing4.pdf http://www.diit.unict.it/users/carchiol/traduttori1314/L09.pdf https://www.cs.clemson.edu/course/cpsc827/material/LLk/LL%20Error%20Recovery.pdf http://teaching.idallen.com/cst8152/98w/panic_mode.html http://staff.cs.upt.ro/~chirila/teaching/upt/c51-pt/aamcij/7113/Fly0024.html http://sisinflab.poliba.it/scioscia/teaching/flc/FLC_14_Error_handling.pdf http://www.site.uottawa.ca/~bochmann/SEG-2106-2506/Notes/M2-3-SyntaxAnalysis/Error-recovery.ppt

Algorithms for Compiler Design.pdf

ghost avatar Feb 15 '17 17:02 ghost

Did you read these references? Did them help you?

ghost avatar Feb 20 '17 22:02 ghost

Hi! Are there any news about implementation?

ghost avatar Feb 22 '17 19:02 ghost

Hi @digitalheir, did you forget about me? :-)

ghost avatar Feb 27 '17 14:02 ghost

I have not forgotten about you. I work a full time job and only work on this project as one of multiple side projects in my free time.

I have made a first success yesterday with synchronizing tokens. I only need to think of a good way of flagging a substring as errored for the backwards part of Viterbi.

I also started reading about Marpa because I was interested in its "ruby slippers" way of error handling but I haven't read far enough yet.

http://dinhe.net/~aredridel/.notmine/PDFs/Parsing/KEGLER,%20Jeffrey%20-%20Marpa,%20a%20practical%20general%20parser:%20the%20recognizer.pdf

digitalheir avatar Feb 28 '17 08:02 digitalheir

I feel like I already half-implemented Jeffrey Kegler's ruby slippers error handling: when the parser is stumped, we know exactly what we expect.

A full description of the parse was also available in Marpa::XS's progress reports, which gave this information in text form. This was wonderful for tracing parsing issues, but it was not a convenient form for post-processing.

As of its latest release, Marpa::R2 makes available the "raw" data behind the progress reports. This is ideal for post-processing. My hope is that users will use this to invent their own advanced parsing tricks -- customized versions of the Ruby Slippers.

I'm steadily moving towards this, so that users can determine themselves whether to add, remove or replace tokens in a callback, based on what the error was and what the parser expected.

digitalheir avatar Feb 28 '17 12:02 digitalheir

Nice idea! Thanks for your answers and time! Can you upload the current project version? I would like to do some test about a particular grammar to verify synchronizing token correctness. I'm going to write a cfg for a java statement, for example while or for loop. Thanks again!

ghost avatar Feb 28 '17 14:02 ghost

Can you upload it, please?

ghost avatar Mar 06 '17 15:03 ghost

Hi @digitalheir, do you use IntelliJ IDEA as Java IDE for this project?

ghost avatar Mar 09 '17 17:03 ghost

Hi,

Yes, I use IDEA. The project gives it away because I forgot to ignore the .iml file in Git.

I have pushed my experimenting so far with error handling, but it is not ready to use. Namely, the tree generation algorithm doesn't yet receive the correct reference to the token for words leading up to an a non-lexical token. This is an easy fix.

More fundamentally, I cannot implement panic mode plainly because shift-reduce parsers only work on non-ambiguous grammars.

My main gripe is this. Consider the following grammar:

S -> AB C; (0.5)
S -> A B ; (0.5)

A -> a a ; (0.5)
B -> b b ; (0.5)
C -> c c ; (0.5)

AB -> a a ; b b ; (0.5)

A -> <error> ; (0.5)
B -> <error> ; (0.5)
AB -> <error> ; (0.5)

On the input a a ; z b ; c c ; with goal S. When encountering z, the parser could apply both B -> <error> ; and AB -> <error> ;. We would like to consider AB because that's the only way to make S. So we need to consider both error rules.

Now consider the same grammar, except the first rule is S -> AB; on the same input. We have established that we want to consider both error rules, and in this case both make a valid parse. However, we want the probability of AB -> <error> ; being applied to be lower than that of B -> <error> ;, because we want to keep the number of ignored tokens minimal.

That's why I have decided to make the error rule probability multiply every time a token is ignored. The inner score of applying B -> <error> ; is then be 0.5^2, and the inner score of AB -> <error> ; is 0.5^4.

digitalheir avatar Mar 10 '17 09:03 digitalheir

Thanks! I' ll wait the final project ☺

ghost avatar Mar 10 '17 10:03 ghost

Can you upload the command line with these new features? Thanks!

ghost avatar Mar 10 '17 17:03 ghost

IntelliJ IDEA give me these errors:

schermata 2017-03-10 alle 19 03 08

ghost avatar Mar 10 '17 18:03 ghost

Hey, I've had another success. I need to clean up some stuff to make sure probability is calculated correctly and the chart is filled with all relevant states, along with adding some tests to verify ecerything works correctly. But the weighty theoretical part is about done.

Consider grammar

S -> a S a (0.9)
S -> c
S -> a <error> a (0.5)
S -> b <error> b (0.5)

On input a b c b a s b b a.

Parser calculates:

└── <start>
    └── S
        ├── a (a)
        ├── S
        │   ├── b (b)
        │   ├── <error> (c)
        │   ├── <error> (b)
        │   ├── <error> (a)
        │   ├── <error> (s)
        │   ├── <error> (b)
        │   └── b (b)
        └── a (a)

with p = .9 * .5^5

If you can't wait to get your hands dirty with the experimental code, you can pull now. It should compile and run.

digitalheir avatar Mar 14 '17 21:03 digitalheir