Schema and optimization handling improvements in from_csv()

Open volkfox opened this issue 1 year ago • 0 comments

There is a number of improvements that we need to implement at some point:

currently, the code does not validate the data against the automatically created schema. This means the data is lazily loaded onto the chain and blows up later in chain operations when picked with collect() that triggers Pydantic checks. For example, check the last example in json-csv-reader.py which looks great until one uses iterate() or collect() on it.
currently, the code automatically creates the schema in "strict" format. We need to default to "lax" format marking the fields "optional" instead of mandatory, and give user an option to change this behavior.
currently, the code does not allow for passing the static schema. We should permit that because the users may choose to enforce which CSV columns they want and what types those should have. It is also related to future capabilities of ser/de-ser hierarchical Pydantic objects to and from CSV using the "object__field" column name conventions.

also see: "spec" argument in from_json()

currently, the code does not allow for limiting the number of entries generated. This prevents optimizations like "from_csv(blah).limit(N)" to work effectively.

also see: "nrows" argument in from_json()

Jul 20 '24 23:07 volkfox