grobid_client_python
grobid_client_python copied to clipboard
Add formula serialization to JSON and Markdown output
TEI XML <formula> elements were being silently dropped during conversion, leaving gaps where equations should appear in both JSON and Markdown output.
Changes
Markdown converter (TEI2Markdown.py)
- Added
_formula_to_markdown()- renders formulas as code blocks when labeled, inline code otherwise - Modified
_extract_fulltext()and_process_paragraph()to process formula elements alongside paragraphs
JSON converter (TEI2LossyJSON.py)
- Added
get_formatted_formula()- creates formula entries with metadata (text, label, xml_id, coords) - Modified
_process_div_with_nested_content()to yield formulas astype: 'formula'entries inbody_text, maintaining document order
Tests (test_equation_serialization.py)
- Added 8 test cases covering formula serialization, ordering, metadata preservation, and edge cases
Examples
Markdown output:
### Data analysis
Percentage of fingers extensions... as indicated in the following equation:
Fext i ¼ 100 FE i T FEi ð1Þ
Where Fext i denotes the metric...
JSON output:
{
"id": "formula_4084a724",
"type": "formula",
"text": "Fext i ¼ 100 FE i T FEi",
"label": "ð1Þ",
"xml_id": "formula_0",
"head_section": "Data analysis"
}
Original prompt
This section details on the original issue you should resolve
<issue_title>Add equations in Json and markdown output </issue_title> <issue_description>Equations are not serialized in the Json and MD output </issue_description>
Comments on the Issue (you are @copilot in this section)
- Fixes kermitt2/grobid-client-python#96
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.