markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

PDF to Markdown doesn't preserve text relationship or indentation

Open NikhilVerma opened this issue 1 year ago • 7 comments

Here's a sample PDF - https://www.bnm.gov.my/documents/20124/963937/Risk+Management+in+Technology+(RMiT).pdf/810b088e-6f4f-aa35-b603-1208ace33619?t=1592866162078

However there are several parsing errors which I will try to highlight below

Line elements aren't preserved

Image

Output

G  8.5

To  promote  effective  technology  discussions  at  the  board  level,  the
composition  of  the  board  and  the  designated  board-level  committee  should
include at least a member with technology experience and competencies.

Indentations are ignored and text ordering is altered

Image
The TRMF must include the following:
(a)  clear definition of technology risk;
(b)  clear responsibilities assigned for the management of technology risk at
different  levels  and  across  functions,  with  appropriate  governance  and
reporting arrangements;
the  identification  of  technology  risks  to  which  the  financial  institution  is
exposed,  including  risks  from  the  adoption  of  new  or  emerging
technology;

(c)

(d)  risk classification of all information assets/systems based on its criticality;
(e)  risk measurement and assessment approaches and methodologies;
(f)
(g)  continuous monitoring to timely detect and address any material risks.

List elements are ignored and presented in entirely new lines

Image
1.

2.

3.

4.

5.

6.

The assurance shall be conducted by an independent external service provider
(ESP) engaged by the financial institution.

The independent ESP must understand the proposed services, the data flows,
system architecture, connectivity as well as its dependencies.

The  independent  ESP  shall  review  the  comprehensiveness  of  the  risk
assessment performed by the financial institution and validate the adequacy of
the control measures implemented or to be implemented.

The Risk Assessment Report (as per Part D in Appendix 7) shall state among
others, the scope of review, risk assessment methodology, summary of findings
and remedial actions (if any).

PDF parsing will always be a PITA, but I think these issues can be addressed by tracking the locations of the elements, right now I feel it simply loops over the textual elements and uses simple algorithms to merge them together

NikhilVerma avatar Dec 17 '24 04:12 NikhilVerma

Is anyone working on this?

Utsav-Mehta avatar Feb 17 '25 04:02 Utsav-Mehta

@gagb is anyone working on this? Feel free to assign this to me!

Utsav-Mehta avatar Feb 19 '25 15:02 Utsav-Mehta

Assigned! Thanks for taking a look! We are a small team and this is an OSS project, so contributions are very welcome. Thanks again!

gagb avatar Feb 19 '25 18:02 gagb

Thanks for assigning @gagb . It's a pleasure, I am considering integrating vision models to handle relationships and indentation. Let me know if this sounds like a good starting point and if you have any suggestions!

Utsav-Mehta avatar Feb 19 '25 18:02 Utsav-Mehta

Sounds like a good experiment. Creating a plugin would be a good first step. Recommend using open source vision models.

gagb avatar Feb 19 '25 21:02 gagb

Thanks for the insights, will be creating a plugin first as you mentioned.

Utsav-Mehta avatar Feb 20 '25 23:02 Utsav-Mehta

Hey @Utsav-Mehta , if you are not actively working on it, I would like to pick this issue. @gagb You can assign if possible.

iamskp11 avatar Sep 16 '25 13:09 iamskp11