parquet conversion failed,Bool column has NA values in column boolean__v
I'm using the below code , while input data has boolean column with null and not null data however it's failing at the parquet conversion i'e "parquet conversion failed,Bool column has NA values in column boolean__v". kindly let me know what could be the issue
for chunk_number, chunk in enumerate(pd.read_csv(**read_csv_args), 1):
fields = []
for col,dtypes in sessionSchema.items():
fields.append(pa.field(col, dtypes, True)) # nullable=True, pass a DataFrame which in fact has nulls it appears the schema is ignored
glue_schema = pa.schema(fields)
table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)
if chunk_number == 1:
schema = table.schema
# Open a Parquet file for writing
pq_writer = pq.ParquetWriter(targetKey, schema, compression='snappy')
# Write CSV chunk to the parquet file
pq_writer.write_table(table)
What version of PyArrow are you using?
I created this minimal reproducible example, can you run it and check if it works for you?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
chunk = pd.DataFrame([True, False, None], columns=['col1'])
field = pa.field("col1", pa.bool_())
glue_schema = pa.schema([field])
table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)
# Open a Parquet file for writing
pq_writer = pq.ParquetWriter('example.parquet',
schema = glue_schema,
compression='snappy')
# Write CSV chunk to the parquet file
pq_writer.write_table(table)
pq_writer.close()
# Read the chunk
pq.read_table('example.parquet').to_pandas()
It should create this output:
>>> pq.read_table('example.parquet').to_pandas()
col1
0 True
1 False
2 None
technically,
csv_file = StringIO("""int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 4||||||| """)
params = { 'filepath_or_buffer': csv_file, 'chunksize': 10, 'encoding': 'UTF-8', 'sep': '|', 'low_memory': True, 'engine': 'python', 'skip_blank_lines': True }
Next, loop through chunk of data and convert csv files into parquet and write into s3. you would be noticing 4th column is 'boolean' datatype and last row has 'NA' which means null data. That's where getting "parquet conversion failed,Bool column has NA values in column boolean__v"
pyarrow==5.0.0; python_full_version >= "3.6.2" and python_version < "3.10" and python_version >= "3.6" - This is the pyarrow version being used in our project
Tried using pyarrow.csv.read_csv to read arrow table from csv and then write to parquet?
Hope this will help:
>>> import io
>>> import pyarrow.csv as csv
>>> s = """int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v
... 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12
... 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 4|||||||
... """
>>> source = io.BytesIO(s.encode())
# Read with pyarrow.csv.read_csv
>>> parse_options = csv.ParseOptions(delimiter="|")
>>> table = csv.read_csv(source, parse_options=parse_options)
# Write to parquet
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet', compression='snappy')
# Check the result
>>> pq.read_table('example.parquet')["Boolean__v"]
<pyarrow.lib.ChunkedArray object at 0x139459450>
[
[
true,
false,
true,
false,
true,
...
true,
false,
true,
false,
null
]
]
You can also define pyarrow.csv.ReadOptions like block_size, encoding and pyarrow.csv.ParseOptions like ignore_empty_lines.
couple of things i do explain here so that you may have visibility and provide me the solution which fits in this, (1) We have a source file like (CSV file, size of 6GB compressed/uncompressed), then we don't read whole file into memory using pandas, do use 'chunks' then pass this to pyarrow to convert parquet and write on s3 until all 'chunk's are done.
This approach have control of memory consumption and run into any high memory usage so that 'chunk' used. however while writing chunk into s3:// folder does it cause below error. Python Error: <>, exitCode: <139>
have you come across any scenario to overcome or where it's happening? please let me
There is not enough information for us to be able to reproduce your issue and help solving it. Will close it as is.