sec-parser
sec-parser copied to clipboard
Fix the TopSectionTitle being split in MSFT filing
Context
MSFT accuracy-test (permalink at the time of posting)
Problem
Titles come out as two separate title elements
{
"text_content": "PART I. FINANCI"
},
{
"text_content": "AL INFORMATION"
},
This is because MSFT puts the section titles into two pieces for some reason
Ideas about a possible solution
Maybe include the line information into the solution: If two elements of the same type (and level) are on the same line, they should probably be identified as a single element