Sambit Kumar Dash

Results 65 comments of Sambit Kumar Dash

> You can initialize `:clipping_rect` in `pdPageExtractText` > > You can go to this location: > https://github.com/sambitdash/PDFIO.jl/blob/95000b69625cfbd51cf7825470def0d4df9192aa/src/PDPageElement.jl#L653 > > This code will for example exclude all Italic fonts. > >...

I think this is one of the authoritative models in this domain. https://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485 There may be later ones, but Punkt tokenizer of NLTK is a similar implementation.

@hhaensel I like the idea of what you are saying. But, I do not think it will work for all use cases. But, it may be working for the files...

I really like the MNIST models. So please do not remove if you can. The reason being, they are pry the only model that can run reasonably on a CPU....

https://github.com/sambitdash/PDFIO.jl/commit/6367aa667fa37b1cb653a165e3957bd5e1b1b6d9 Fixes it but no test cases are added as the file is no longer accessible.

Add test cases for AGL.

[isle-of-man-inflation-report-november-2021.pdf](https://github.com/sambitdash/PDFIO.jl/files/10070475/isle-of-man-inflation-report-november-2021.pdf) Adding a copy of the file which I got by Googling. But, this version does not have an AGL code. The suggested file is no longer on the site.We...

@vargonis you can use `pdPageEvalContent` and get the content tree. The content tree has all the bounding box information at a text run level.

@bdeonovic Sorry for my delay in looking into the file. The CMap file in the PDF is not aligned to the spec. Figure-6 in the attached spec. [5014.CIDFont_Spec.pdf](https://github.com/sambitdash/PDFIO.jl/files/10013195/5014.CIDFont_Spec.pdf) That's the...