PDF to Markdown doesn't preserve text relationship or indentation
Here's a sample PDF - https://www.bnm.gov.my/documents/20124/963937/Risk+Management+in+Technology+(RMiT).pdf/810b088e-6f4f-aa35-b603-1208ace33619?t=1592866162078
However there are several parsing errors which I will try to highlight below
Line elements aren't preserved
Output
G 8.5
To promote effective technology discussions at the board level, the
composition of the board and the designated board-level committee should
include at least a member with technology experience and competencies.
Indentations are ignored and text ordering is altered
The TRMF must include the following:
(a) clear definition of technology risk;
(b) clear responsibilities assigned for the management of technology risk at
different levels and across functions, with appropriate governance and
reporting arrangements;
the identification of technology risks to which the financial institution is
exposed, including risks from the adoption of new or emerging
technology;
(c)
(d) risk classification of all information assets/systems based on its criticality;
(e) risk measurement and assessment approaches and methodologies;
(f)
(g) continuous monitoring to timely detect and address any material risks.
List elements are ignored and presented in entirely new lines
1.
2.
3.
4.
5.
6.
The assurance shall be conducted by an independent external service provider
(ESP) engaged by the financial institution.
The independent ESP must understand the proposed services, the data flows,
system architecture, connectivity as well as its dependencies.
The independent ESP shall review the comprehensiveness of the risk
assessment performed by the financial institution and validate the adequacy of
the control measures implemented or to be implemented.
The Risk Assessment Report (as per Part D in Appendix 7) shall state among
others, the scope of review, risk assessment methodology, summary of findings
and remedial actions (if any).
PDF parsing will always be a PITA, but I think these issues can be addressed by tracking the locations of the elements, right now I feel it simply loops over the textual elements and uses simple algorithms to merge them together
Is anyone working on this?
@gagb is anyone working on this? Feel free to assign this to me!
Assigned! Thanks for taking a look! We are a small team and this is an OSS project, so contributions are very welcome. Thanks again!
Thanks for assigning @gagb . It's a pleasure, I am considering integrating vision models to handle relationships and indentation. Let me know if this sounds like a good starting point and if you have any suggestions!
Sounds like a good experiment. Creating a plugin would be a good first step. Recommend using open source vision models.
Thanks for the insights, will be creating a plugin first as you mentioned.
Hey @Utsav-Mehta , if you are not actively working on it, I would like to pick this issue. @gagb You can assign if possible.