unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Misclassification of element types on ADV forms

Open lavish2210 opened this issue 4 months ago • 3 comments

I am using the hi_res model locally and tried it both with and without chunking as well. I also tried the chipper model via api, but faced similar issues as well.

Major issues faced by us while trying it on ADV Brochures -

  1. Classification Issue - There are some cases when the title and its corresponding text are classified in a single token, and this whole underlying text has its parent pointing to the header of the page. For example, the following image is a snippet from page no.-2 of Blackrock pdf(https://files.adviserinfo.sec.gov/IAPD/Content/Common/crd_iapd_Brochure.aspx?BRCHR_VRSN_ID=848663).

image

In the above snippet text Item 2. Material Changes Since the last annual update to the Form ADV Part 2A (the “Brochure”) on March 31, 2022, material changes to this Brochure include amendments to the following items: is classified as a narrative text which ideally should not have been the case.

  1. Table Extraction Issue - The following snippet is taken from page no. 24 of the Blackrock pdf(linked in Issue - 1). image We didn't receive the correct table structure for the above table.

  2. Multicolumn documents - We are not able to get the correct structure for multicolumn PDFs. First, the right column is recognized, and then the left column(and that too row-wise). Ideally, the whole left column must be recognized at once, and then the whole right column. https://files.adviserinfo.sec.gov/IAPD/Content/Common/crd_iapd_Brochure.aspx?BRCHR_VRSN_ID=821958

  3. Chunking issue - In continuation to Issue - 1, if the text is not classified correctly as title then chunking is not also not working correctly as well.

Please provide support on these issues.

lavish2210 avatar Feb 12 '24 09:02 lavish2210

@lavish2210 - Thanks for reporting this. We're currently doing data annotation to improve our partitioning models and will include this in the data set.

MthwRobinson avatar Feb 12 '24 13:02 MthwRobinson

It would be great if you could share the timeline by which all the above-listed issues will be solved.

lavish2210 avatar Feb 13 '24 06:02 lavish2210

We'll post timelines on model related updates in our Slack channel

MthwRobinson avatar Feb 13 '24 13:02 MthwRobinson