unstructured
unstructured copied to clipboard
Fix invalid evaluation doctype deduction
There was a bug in evaluation.py that caused extensions of certain files to be detected improperly.
Evaluation files are expected to have two extensions, e.g. foobar.pdf.json
because they were partitioned first. The code was prone to a case when more than 3 dots are present in file name.
- [x] adjust doctype extraction for:
- [x] TextExtractionMetricsCalculator
- [x] TableStructureMetricsCalculator
- [x] ElementTypeMetricsCalculator
- [x] unit test