normcap Add support for Latex/Math/Equations

Add support for Latex/Math/Equations

Open Vilhelm-Ian opened this issue 2 years ago • 1 comments

trafficstars

Describe your problem:

while studying math from pdfs it would be nice to be able to copy equations

Solution you'd like to see:

train the model on math equations

Alternatives you considered:

No response

Additional information or remarks:

No response

Oct 03 '23 05:10 Vilhelm-Ian

Hi @Vilhelm-Ian, thanks for you feature request!

TLDR;

I would love to see this integrated in NormCap, but due to it's complexity, I probably won't have enough time to work on this my own. But I'm definitely open for contributions here.

Some background

Tesseract, the OCR framework I leverage in NormCap, initially had some support for detecting equations. But its results were quite weak, so it got abandoned. I doubt, that it is now feasible to train a Tesseract model for decent math detection.

But it definitely would be possible to integrate an additional OCR framework into NormCap, which is optimized for LaTeX/Equations. Some open source frameworks actually deliver quite promising results, e.g. pix2text or LaTeX-OCR.

However, the difficulty is to find one that satisfies non-functional requirements by NormCap:

Feasible packaging for all system/platforms (macOS/Linux/Windows, x64/M1)
Few dependencies (in terms of numbers and file size)
100% offline (except maybe for model downloading)

Unfortunately, this probably rules out all torch or tensorflow based solutions, as packaging and dependencies are likely a nightmare. With also online-services ruled out, I'm not aware of any framework satisfying those requirements. However, in theory it should be possible to transform a torch/tensorflow model into an agnostic format like ONNX and use a much leaner runtime for inference. I'm just not aware of any maintained project that does this.

Those are just some initial thought, I'm interested to read opinions by others! :slightly_smiling_face:

Oct 03 '23 12:10 dynobo

normcap normcap copied to clipboard

Add support for Latex/Math/Equations

Describe your problem:

Solution you'd like to see:

Alternatives you considered:

Additional information or remarks:

TLDR;

Some background

normcap
normcap copied to clipboard