arrow parquet conversion failed,Bool column has NA values in column boolean_

I'm using the below code , while input data has boolean column with null and not null data however it's failing at the parquet conversion i'e "parquet conversion failed,Bool column has NA values in column boolean__v". kindly let me know what could be the issue

for chunk_number, chunk in enumerate(pd.read_csv(**read_csv_args), 1):
            fields = []
                    for col,dtypes in sessionSchema.items():
                        fields.append(pa.field(col, dtypes, True)) # nullable=True, pass a DataFrame which in fact has nulls it appears the schema is ignored
                    glue_schema = pa.schema(fields)

                table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)
                if chunk_number == 1:
                    schema = table.schema
                    # Open a Parquet file for writing
                    pq_writer = pq.ParquetWriter(targetKey, schema, compression='snappy')
               # Write CSV chunk to the parquet file
                pq_writer.write_table(table)

May 11 '22 22:05 micomahesh1982

What version of PyArrow are you using?

I created this minimal reproducible example, can you run it and check if it works for you?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

chunk = pd.DataFrame([True, False, None], columns=['col1'])
field = pa.field("col1", pa.bool_())
glue_schema = pa.schema([field])

table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)

# Open a Parquet file for writing
pq_writer = pq.ParquetWriter('example.parquet', 
                             schema = glue_schema,
                             compression='snappy')

# Write CSV chunk to the parquet file
pq_writer.write_table(table)
pq_writer.close()

# Read the chunk
pq.read_table('example.parquet').to_pandas()

It should create this output:

>>> pq.read_table('example.parquet').to_pandas()
    col1
0   True
1  False
2   None

May 12 '22 04:05 AlenkaF

technically,

csv_file = StringIO("""int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 4||||||| """)

params = { 'filepath_or_buffer': csv_file, 'chunksize': 10, 'encoding': 'UTF-8', 'sep': '|', 'low_memory': True, 'engine': 'python', 'skip_blank_lines': True }

Next, loop through chunk of data and convert csv files into parquet and write into s3. you would be noticing 4th column is 'boolean' datatype and last row has 'NA' which means null data. That's where getting "parquet conversion failed,Bool column has NA values in column boolean__v"

May 12 '22 05:05 micomahesh1982

pyarrow==5.0.0; python_full_version >= "3.6.2" and python_version < "3.10" and python_version >= "3.6" - This is the pyarrow version being used in our project

May 12 '22 05:05 micomahesh1982

Tried using pyarrow.csv.read_csv to read arrow table from csv and then write to parquet?

Hope this will help:

>>> import io
>>> import pyarrow.csv as csv

>>> s = """int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v
... 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12
... 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
... 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
... 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
... 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
... 4|||||||
... """
>>> source = io.BytesIO(s.encode())

# Read with pyarrow.csv.read_csv
>>> parse_options = csv.ParseOptions(delimiter="|")
>>> table = csv.read_csv(source, parse_options=parse_options)

# Write to parquet
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet', compression='snappy')

# Check the result
>>> pq.read_table('example.parquet')["Boolean__v"]
<pyarrow.lib.ChunkedArray object at 0x139459450>
[
  [
    true,
    false,
    true,
    false,
    true,
    ...
    true,
    false,
    true,
    false,
    null
  ]
]

You can also define pyarrow.csv.ReadOptions like block_size, encoding and pyarrow.csv.ParseOptions like ignore_empty_lines.

May 12 '22 06:05 AlenkaF

couple of things i do explain here so that you may have visibility and provide me the solution which fits in this, (1) We have a source file like (CSV file, size of 6GB compressed/uncompressed), then we don't read whole file into memory using pandas, do use 'chunks' then pass this to pyarrow to convert parquet and write on s3 until all 'chunk's are done.

This approach have control of memory consumption and run into any high memory usage so that 'chunk' used. however while writing chunk into s3:// folder does it cause below error. Python Error: <>, exitCode: <139>

have you come across any scenario to overcome or where it's happening? please let me

May 13 '22 16:05 micomahesh1982

There is not enough information for us to be able to reproduce your issue and help solving it. Will close it as is.

Oct 12 '22 09:10 AlenkaF