unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Number getting converted into scientific notation in metadata.text_as_html

Open sahil0094 opened this issue 11 months ago • 0 comments

Problem Description

When using partition_html() and extracting table metadata via chunk.metadata.text_as_html, numeric values are being automatically converted to exponential notation.

Example

  • Input Number: 478923
  • Converted Output: 4.7e+05

Steps to Reproduce

  1. Use partition_html() on an HTML file
  2. Chunking using chunk by title function and extracting tabular data
  3. Access chunk.metadata.text_as_html
  4. Observe numeric value conversion

Expected Behavior

  • Numeric values should be preserved in their original format
  • No automatic scientific notation conversion

Environment Details

  • Unstructured Library Version: 0.10.28
  • Python Version: 3.11.0rc1
  • Environment: databricks runtime 15.4 LTS ML

Potential Impact

This automatic conversion can cause data integrity issues, especially in financial or scientific data processing.

Suggested Investigation

  • Review number parsing/serialization logic
  • Check type conversion mechanisms in metadata handling

sahil0094 avatar Jan 17 '25 11:01 sahil0094