json2parquet
json2parquet copied to clipboard
Support for optional fields
I am relatively new to Python so apologies if I'm interpreting the code incorrectly. What I'm trying to do is read a schema from an existing Parquet file and then populate a new file from a JSON file generated with parquet-tools and then modified to match the conditions for the test I am creating. If I dump the schema with parquet-tools it shows that all of the fields are optional. For example:
last_failure_date: OPTIONAL INT32 O:DATE R:0 D:1
like most JSON serializers, if a value is not present in the data it's not serialized in the JSON (e.g. the field is NULL).
I think what's happening is when it gets to _convert_data_with_schema()
it does _col = column_data.get(column.name)
, but in this case last_failure_date
is None and then it tries to convert None to date which gets me:
Traceback (most recent call last):
File "sigma_data_gen.py", line 24, in <module>
convert_json(sys.argv[2], sys.argv[3], schema=schema, use_deprecated_int96_timestamps=True)
File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 180, in convert_json
data = load_json(input, schema=schema, date_format=date_format, field_aliases=field_aliases)
File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 146, in load_json
return ingest_data(json_data, schema=schema, date_format=date_format, field_aliases=field_aliases)
File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 32, in ingest_data
return _convert_data_with_schema(data, schema, date_format=date_format, field_aliases=field_aliases)
File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 101, in _convert_data_with_schema
array_data.append(pa.array(_converted_col, type=pa.date32()))
File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 133, in _date_converter
dt = ciso8601.parse_datetime(date_str)
TypeError: argument 1 must be str, not None
So it seems like there should be a check there to make sure _col
has a value and skip importing that field if it doesn't (or at least skip the casting / conversion, whichever would make more sense for PyArrow).
Does that sound correct?