sec-parser icon indicating copy to clipboard operation
sec-parser copied to clipboard

Make HighlightedTextClassifier work with `<b>` tags

Open Elijas opened this issue 6 months ago • 4 comments

Discussed in https://github.com/orgs/alphanome-ai/discussions/56

Originally posted by Elijas November 24, 2023

Example document

https://www.sec.gov/Archives/edgar/data/1675149/000119312518236766/d828236d10q.htm

image
 <p style="margin-top:9pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  Options to purchase 1 million shares of common stock at a weighted average exercise price of $36.28 were
outstanding as of June 30, 2017, but were not included in the computation of diluted EPS because they were anti-dilutive, as the exercise prices of the options were greater than the average market price of Alcoa Corporation’s common stock.
 </p>
 <p style="margin-top:13pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">
  <b>
   G. Accumulated Other Comprehensive Loss
  </b>
 </p>
 <p style="margin-top:6pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  The following table details the activity of the three components that comprise Accumulated other comprehensive loss for both Alcoa
Corporation’s shareholders and Noncontrolling interest:
 </p>

Goal

The "G. Accumulated Other Comprehensive Loss" should be recognized as HighlightedTextElement (and therefore, TitleElement).

Most likely, you will have to get a percentage of text that is covered inside the <b> tag, by reusing the parts implemented in the HighlightedTextElement. This will help you avoid situations where text text text <b>bold</b> text text is recognized as higlighted

Elijas avatar Dec 22 '23 16:12 Elijas