BobLd comments

Results 143 comments of


                                            BobLd

hocr line combination

It's not really clear what the issue is but I'd say you have several `word` in the hocr export that correspond to a single word in the document. This might...

hocr line combination

Can you provide the pdf document? Also, can you detail how you did the extraction (provide the code)? Without that, it will be extremely difficult to understand your issue and...

testing page.text

Hi Eliot, I just did a pull request with 2 Document Layout Analysis tools: - Nearest Neighbour Word Extractor - Recursive X-Y Cut algorithm Might be a solution to this...

testing page.text

Another way to get the lines would be to use document analysis tools. Here is an example of what it could look like: ```csharp using (PdfDocument document = PdfDocument.Open(pdfPath)) {...

testing page.text

@Martin005 , if you want to see some works in progress, you can have a look at my fork's branches

Get captions from Images

Hi @famda, could you give more details about what you mean by *labels*?

Get captions from Images

@famda thanks for the explaination. The first warning is that what you're trying to achieve is far from straightforward... Maybe a starting point would be to have a look here...

Get captions from Images

## Caption candidate Something that might be a good start is to check is a line is a caption candidate. This is how they do it: ```scala // Words that...

A mistake that potentially become a "performance booster" for WhitespaceCoverExtractor in DocumentLayoutAnalysis

Hi huzhiguan, thanks for creating the issue and taking the time to run the analysis. Results are very interesting. I need to run some tests to understand these empty intersections...

PDFTextStripper class

Hi @ivanicin, thanks for the feedback! Can you give example on how you'd use `PDFTextStripper`, is it to extract text? If it's the case, what would be the difference with...