grobid_client_python icon indicating copy to clipboard operation
grobid_client_python copied to clipboard

Add formula serialization to JSON and Markdown output

Open Copilot opened this issue 3 months ago • 0 comments

TEI XML <formula> elements were being silently dropped during conversion, leaving gaps where equations should appear in both JSON and Markdown output.

Changes

Markdown converter (TEI2Markdown.py)

  • Added _formula_to_markdown() - renders formulas as code blocks when labeled, inline code otherwise
  • Modified _extract_fulltext() and _process_paragraph() to process formula elements alongside paragraphs

JSON converter (TEI2LossyJSON.py)

  • Added get_formatted_formula() - creates formula entries with metadata (text, label, xml_id, coords)
  • Modified _process_div_with_nested_content() to yield formulas as type: 'formula' entries in body_text, maintaining document order

Tests (test_equation_serialization.py)

  • Added 8 test cases covering formula serialization, ordering, metadata preservation, and edge cases

Examples

Markdown output:

### Data analysis
Percentage of fingers extensions... as indicated in the following equation:

Fext i ¼ 100 FE i T FEi ð1Þ


Where Fext i denotes the metric...

JSON output:

{
  "id": "formula_4084a724",
  "type": "formula",
  "text": "Fext i ¼ 100 FE i T FEi",
  "label": "ð1Þ",
  "xml_id": "formula_0",
  "head_section": "Data analysis"
}
Original prompt

This section details on the original issue you should resolve

<issue_title>Add equations in Json and markdown output </issue_title> <issue_description>Equations are not serialized in the Json and MD output </issue_description>

Comments on the Issue (you are @copilot in this section)

  • Fixes kermitt2/grobid-client-python#96

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot avatar Nov 16 '25 22:11 Copilot