markitdown PDF to Markdown doesn't preserve text relationship or indentation

Here's a sample PDF - https://www.bnm.gov.my/documents/20124/963937/Risk+Management+in+Technology+(RMiT).pdf/810b088e-6f4f-aa35-b603-1208ace33619?t=1592866162078

However there are several parsing errors which I will try to highlight below

Line elements aren't preserved

Output

G  8.5

To  promote  effective  technology  discussions  at  the  board  level,  the
composition  of  the  board  and  the  designated  board-level  committee  should
include at least a member with technology experience and competencies.

Indentations are ignored and text ordering is altered

The TRMF must include the following:
(a)  clear definition of technology risk;
(b)  clear responsibilities assigned for the management of technology risk at
different  levels  and  across  functions,  with  appropriate  governance  and
reporting arrangements;
the  identification  of  technology  risks  to  which  the  financial  institution  is
exposed,  including  risks  from  the  adoption  of  new  or  emerging
technology;

(c)

(d)  risk classification of all information assets/systems based on its criticality;
(e)  risk measurement and assessment approaches and methodologies;
(f)
(g)  continuous monitoring to timely detect and address any material risks.

List elements are ignored and presented in entirely new lines

1.

2.

3.

4.

5.

6.

The assurance shall be conducted by an independent external service provider
(ESP) engaged by the financial institution.

The independent ESP must understand the proposed services, the data flows,
system architecture, connectivity as well as its dependencies.

The  independent  ESP  shall  review  the  comprehensiveness  of  the  risk
assessment performed by the financial institution and validate the adequacy of
the control measures implemented or to be implemented.

The Risk Assessment Report (as per Part D in Appendix 7) shall state among
others, the scope of review, risk assessment methodology, summary of findings
and remedial actions (if any).

PDF parsing will always be a PITA, but I think these issues can be addressed by tracking the locations of the elements, right now I feel it simply loops over the textual elements and uses simple algorithms to merge them together

Dec 17 '24 04:12 NikhilVerma

Is anyone working on this?

Feb 17 '25 04:02 Utsav-Mehta

@gagb is anyone working on this? Feel free to assign this to me!

Feb 19 '25 15:02 Utsav-Mehta

Assigned! Thanks for taking a look! We are a small team and this is an OSS project, so contributions are very welcome. Thanks again!

Feb 19 '25 18:02 gagb

Thanks for assigning @gagb . It's a pleasure, I am considering integrating vision models to handle relationships and indentation. Let me know if this sounds like a good starting point and if you have any suggestions!

Feb 19 '25 18:02 Utsav-Mehta

Sounds like a good experiment. Creating a plugin would be a good first step. Recommend using open source vision models.

Feb 19 '25 21:02 gagb

Thanks for the insights, will be creating a plugin first as you mentioned.

Feb 20 '25 23:02 Utsav-Mehta

Hey @Utsav-Mehta , if you are not actively working on it, I would like to pick this issue. @gagb You can assign if possible.

Sep 16 '25 13:09 iamskp11