json2parquet icon indicating copy to clipboard operation
json2parquet copied to clipboard

Support for optional fields

Open reyes-c1 opened this issue 3 years ago • 0 comments

I am relatively new to Python so apologies if I'm interpreting the code incorrectly. What I'm trying to do is read a schema from an existing Parquet file and then populate a new file from a JSON file generated with parquet-tools and then modified to match the conditions for the test I am creating. If I dump the schema with parquet-tools it shows that all of the fields are optional. For example:

last_failure_date:    OPTIONAL INT32 O:DATE R:0 D:1

like most JSON serializers, if a value is not present in the data it's not serialized in the JSON (e.g. the field is NULL).

I think what's happening is when it gets to _convert_data_with_schema() it does _col = column_data.get(column.name), but in this case last_failure_date is None and then it tries to convert None to date which gets me:

Traceback (most recent call last):
  File "sigma_data_gen.py", line 24, in <module>
    convert_json(sys.argv[2], sys.argv[3], schema=schema, use_deprecated_int96_timestamps=True)
  File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 180, in convert_json
    data = load_json(input, schema=schema, date_format=date_format, field_aliases=field_aliases)
  File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 146, in load_json
    return ingest_data(json_data, schema=schema, date_format=date_format, field_aliases=field_aliases)
  File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 32, in ingest_data
    return _convert_data_with_schema(data, schema, date_format=date_format, field_aliases=field_aliases)
  File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 101, in _convert_data_with_schema
    array_data.append(pa.array(_converted_col, type=pa.date32()))
  File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
  File "/Users/Reyes/.local/share/virtualenvs/data-gen/lib/python3.8/site-packages/json2parquet/client.py", line 133, in _date_converter
    dt = ciso8601.parse_datetime(date_str)
TypeError: argument 1 must be str, not None

So it seems like there should be a check there to make sure _col has a value and skip importing that field if it doesn't (or at least skip the casting / conversion, whichever would make more sense for PyArrow).

Does that sound correct?

reyes-c1 avatar Apr 15 '21 15:04 reyes-c1