math-formula-recognition I think you design the wrong way to train and a false accuracy calculation process

I think you design the wrong way to train and a false accuracy calculation process

Open MC-E opened this issue 5 years ago • 4 comments

Hi, I'm a beginner,I have a lot of confusion about your model. When you calculate accuracy and loss,why you keep the “pad” in the result？It is not a part of the ground truth.(e.g result=“1+2+x+3 pad pad pad pad pad pad pad pad pad.................” label="a-b-c=6 pad pad pad pad pad pad pad pad pad.................") So,with the wrong label and wrong calculate method "pad" contributed most of the accuracy,when I remove "pad" from the end of output the accuracy can only reach about 0.08.

Aug 12 '19 14:08 MC-E

The padding is only used to be able to batch different length sequences and it is very easy for the model to learn that once it has seen all characters, it the rest should be padding.

The paper didn't use CTC and at the time PyTorch didn't have an implementation of CTC, but I would definitely use the CTC loss for any sort of variable length text output, although it's not strictly necessary.

For the accuracy, there are different accuracies I've used:

Excerpt from: notes/evaluate-first-implementation.ipynb

Evaluation of the model

Different error rates are used to compare the results. The Token error rate is the length normalised edit distance of the whole sequence, where each element is one LaTeX token, that means that for example \sum is a single token. The token also include the special tokens <SOS>, <EOS> and <PAD>. Since these are not relevant for the LaTeX string, they are excluded in the Token error rate (no special tokens) metric. Finally, the Symbol error rate is the length normalised edit distance of the mathematical symbols in the LaTeX string. This means that all tokens that are purely for formatting or grouping (e.g. \left, {, \mbox), are excluded. Sub- and superscripts (i.e. ^ and _) are also excluded, even though they technically change the semantics of the expression.

Besides the error rates, there is also the expression recognition rate, which is how many expression have been recognised correctly, that is without a single error in them.

Sequence example

\mbox{\cos} \left( x + y \right)

All tokens No special tokens Symbols

<SOS> \mbox \cos

\mbox { (

{ \cos x

\cos \left +

\left ( y

( x )

x +

+ y

y \right

\right )

)

<EOS>

<PAD>

<PAD>

All tokens	No special tokens	Symbols
`<SOS>`	`\mbox`	`\cos`
`\mbox`	`{`	`(`
`{`	`\cos`	`x`
`\cos`	`\left`	`+`
`\left`	`(`	`y`
`(`	`x`	`)`
`x`	`+`
`+`	`y`
`y`	`\right`
`\right`	`)`
`)`
`<EOS>`
`<PAD>`
`<PAD>`

During training only the all tokens accuracy has been used, but for the evaluate.py all three mentioned above are used. Those should also be reported during training, but for simplicity it was good enough, since the padding is easy to learn and from then on the improvements are purely to the actual symbols.

As for the incorrect labels, I am not aware of any being wrong and I've extracted them from the official data with the data_tools/extract_groundtruth.py script.

Aug 12 '19 15:08 jungomi

I konw you want to use "pad" to keep the same length , but "pad" is not part of the ground truth , you need to remove them when you calculate loss and accuracy.The ground truth you get from "data_tools/extract_groundtruth.py" is ok but you add too many "pad" in the end of ground truth when you train the model.You can print the output of network , the part of ground truth is almost all wrong and the part of "pad" is almost all right . That's what I'm talking about: "pad" contribute most of the accuracy.

Aug 12 '19 16:08 MC-E

pytorch now has a ctc loss.would u like to add it to your model? May I ask you what si the difference between this repository and https://github.com/JianshuZhang/WAP. which repo should I choose if I was to use it in my work? thanks a lot and look forward to your reply

Aug 13 '19 01:08 Zhang-O

I will not change it to use CTC loss as I will no longer modify anything on this model, since it was from a paper that I found interesting and I wanted to explore the attention mechanism. My goal was never to optimise the accuracy of the model.

The biggest difference to https://github.com/JianshuZhang/WAP is that this has been implemented in PyTorch which makes it much easier to explore and modify the architecture to your liking, at least from my experience. If you want to build a model around this architecture, I would say you should have a look at the implementation of this model in https://github.com/jungomi/math-formula-recognition/blob/master/model.py, which is modelled after the figure found in the paper: https://github.com/jungomi/math-formula-recognition/blob/master/notes/figures/dense-encoder.png.

Personally, I don't think I would use this architecture, because I think the complexity is too high for the actual benefit it gives. Any decent convolutional encoder with some stacked BiLSTMs as decoder will probably do just as well.

If you just want a working model with good accuracies, this is definitely not it. Maybe https://github.com/JianshuZhang/WAP will work for that, but I cannot affirm that, since I have never used it myself. Another interesting approach I've seen recently, which also outperforms the paper that the model of this repository is based on, is Stroke extraction for offline handwritten mathematical expression recognition, which extracts the strokes from the images and uses MyScript to recognise the text, the code is available at https://github.com/chungkwong/mathocr-myscript and that seems much more like an application that could be used by the end user, instead of just being for research.

Aug 13 '19 10:08 jungomi

math-formula-recognition math-formula-recognition copied to clipboard

I think you design the wrong way to train and a false accuracy calculation process

Evaluation of the model

Sequence example

math-formula-recognition
math-formula-recognition copied to clipboard