BobLd

Results 143 comments of BobLd

It's not really clear what the issue is but I'd say you have several `word` in the hocr export that correspond to a single word in the document. This might...

Can you provide the pdf document? Also, can you detail how you did the extraction (provide the code)? Without that, it will be extremely difficult to understand your issue and...

Hi Eliot, I just did a pull request with 2 Document Layout Analysis tools: - Nearest Neighbour Word Extractor - Recursive X-Y Cut algorithm Might be a solution to this...

Another way to get the lines would be to use document analysis tools. Here is an example of what it could look like: ```csharp using (PdfDocument document = PdfDocument.Open(pdfPath)) {...

@Martin005 , if you want to see some works in progress, you can have a look at my fork's branches

Hi @famda, could you give more details about what you mean by *labels*?

@famda thanks for the explaination. The first warning is that what you're trying to achieve is far from straightforward... Maybe a starting point would be to have a look here...

## Caption candidate Something that might be a good start is to check is a line is a caption candidate. This is how they do it: ```scala // Words that...

Hi huzhiguan, thanks for creating the issue and taking the time to run the analysis. Results are very interesting. I need to run some tests to understand these empty intersections...

Hi @ivanicin, thanks for the feedback! Can you give example on how you'd use `PDFTextStripper`, is it to extract text? If it's the case, what would be the difference with...