Cesar Berrospi Ramis
Cesar Berrospi Ramis
Thanks @remod to submit this issue. Formatted text in HTML is indeed skipped unless it is part of a paragraph or another supported tag. This will be addressed soon together...
@remod please note that we still have this request on focus. There is a similar PR that should be finalized and merged soon, https://github.com/docling-project/docling/pull/1411. After that, we will ensure that...
@Ra5hidIslam you are very welcome to help with this issue. You could start with the current version on `main`. Some aspects have been already addressed, like the formulas, but the...
Some additional comments: - Rename the input format objects from `PUBMED` to `XML_PUBMED` for verity clarity (first the type, then the collection) - rename test files from `nxml` to `xml`...
> This is good. My last question on this topic, do we have already a way for a different XML format? @dolfim-ibm This is addressed in #606 , which should...
Thanks @kush-gupt for your contribution. Regarding the AI-generated code in the test file, while we do not generally prohibit AI-generated code, we would require to: - Ensure that the terms...
This would be a great feature. Please, feel free to contribute.
For more background on this issue, attaching here links to PDF files where I see the same behavior: - File `2382400.pdf` within https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2382.zip - File `2028148.pdf` within https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2028.zip - File...
> @ceberam Thanks for the detailed review. I will wait for your response until next week. Meanwhile, do you have any other issues which I can work on? @akanshajain231999 you...
@miohtama thanks for reporting this issue. We are currently checking the performance of the HTML table parsing, which got updated in the recent releases and could be the root cause...