framework icon indicating copy to clipboard operation
framework copied to clipboard

ExcelControl(stringified=True) documentation could clarify interaction with schemas

Open richardt-engineb opened this issue 1 year ago • 1 comments

Overview

formats.excelControl.stringified is documented as "Stringifies all the cell values.".

However it appears that it doesn't actually do that, even in the test cases that were added as part of the PR that added the stringified option.

e.g. in def test_xls_parser_cast_int_to_string_1251_xlsx() the test is:

resource = Resource(descriptor, control=formats.ExcelControl(stringified=True))
assert resource.read_rows() == [
    {"A": "001", "B": "b", "C": "1", "D": "a", "E": 1},
    {"A": "002", "B": "c", "C": "1", "D": "1", "E": 1},
]

where the "E": 1 is a number not a string.

Or an even simpler reproduction:

from frictionless import Resource, formats

res = Resource(path="data/table-infer-boolean.xlsx", control=formats.ExcelControl(stringified=True))
print(res.read_rows())

prints:

[{'number': Decimal('1.0'), 'string': 'two', 'boolean': True}, {'number': Decimal('3.0'), 'string': 'four', 'boolean': False}]

Is this the intended behaviour and I am misunderstanding the purpose of stringified=True or is this a bug (which I am happy to look at fixing).

To be honest, the code looks like it should convert everything to strings, but it doesn't seem to according to the unit tests (and our tests). Perhaps @roll or @shashigharti might remember from their work on the PR that created this capability?

richardt-engineb avatar Jun 12 '24 16:06 richardt-engineb

With a bit more research, I think I understand what it is doing (though still counter-intuitive).

The stringified=True only stringifies the lowest level return from the Excel parser. Before you get to read_rows() or similar, the schema is used to convert whatever is returned by the parser. If you don't provide a schema, then detect_schema() (in detector.py) will detect a schema and that will likely pick types other than string.

So to actually get an output that is all strings you need both stringified=True and a schema with all columns set to string types.

So using table_infer_boolean.xlsx which looks like:

/ a b c
1 number string boolean
2 1 two true
3 3 four false

depending on the setting for stringified and schema we get different output:

stringified schema output notes
False None [{'number': 1, 'string': 'two', 'boolean': True}, {'number': 3, 'string': 'four', 'boolean': False}] Inferred schema, typed output
True None [{'number': 1, 'string': 'two', 'boolean': True}, {'number': 3, 'string': 'four', 'boolean': False}] Inferred schema overrides stringified so still typed output
False {"fields": [{"name": "number", "type": "string"},{"name": "string", "type": "string"},{"name": "boolean", "type": "string"}]} [{'number': None, 'string': 'two', 'boolean': None}, {'number': None, 'string': 'four', 'boolean': None}] Without stringified, but with an all-string schema, non-string cells are returned as None
True {"fields": [{"name": "number", "type": "string"},{"name": "string", "type": "string"},{"name": "boolean", "type": "string"}]} [{'number': '1.0', 'string': 'two', 'boolean': 'True'}, {'number': '3.0', 'string': 'four', 'boolean': 'False'}] Both stringified and all-string schema is the only way to get all-string output. Note though, that the 1 has been converted to "1.0" which is not same, so there has clearly been some type conversions even in this case.

This could probably do with a documentation update to clarify what stringified does (and doesn't) do.

richardt-engineb avatar Jun 12 '24 17:06 richardt-engineb