data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Test framework failing with "Schema of the two tables is not the same" while columns match

Open agoyal26 opened this issue 1 year ago • 3 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Library/core

What happened + What you expected to happen

I expected test to pass when column names are same for output and expected tables but its failing with error code : "Schema of the two tables is not the same"

Reproduction script

import pyarrow as pa import pandas as pd df = pd.read_parquet("test-data/input/sample_1.parquet") table = pa.Table.from_pandas(df) table.schema document_id: string contents: string document: string title: string date_acquired: string repo_name: string license: string language: string ext: string size: int64 dataset: string -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1523 type(table.schema) <class 'pyarrow.lib.Schema'> type(dir(table.schema)) <class 'list'> df2 = pd.read_parquet("test-data/expected/sample_1.parquet") table2 = pa.Table.from_pandas(df2) table2.schema document_id: string contents: string document: string title: string date_acquired: string repo_name: string license: string language: string ext: string size: int64 dataset: string line_mean: double line_max: int64 total_num_lines: int64 avg_longest_lines: double alphanum_frac: double char_token_ratio: double autogenerated: bool config_or_test: bool has_no_keywords: bool has_few_assignments: bool is_xml: bool is_html: bool has_ast: bool code_complexity: string -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 3234 type(table.schema) <class 'pyarrow.lib.Schema'>

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

agoyal26 avatar Jul 09 '24 09:07 agoyal26

Thanks for the submission. However, I don't understand your reproduction script. It does not seem to use any of the data-prep-kit library. Can you provide a set of CLI or other commands and the python file you're running and where you're running it, etc?

daw3rd avatar Jul 11 '24 22:07 daw3rd

This issue is seen when transform uses pandas and pandas dataframe is converted to parquet table..

# pyarrow table to pandas
df = table.to_pandas()
new_df = process_with_some_func(df)


out_table = pa.Table.from_pandas(new_df)
# This step adds a new __index__ to the table which changes the schema, so
the default option is the culprit here.

How to solve this:

Don't use default option while converting from pandas, use preserve_index=False

out_table = pa.Table.from_pandas(new_df, preserve_index=False)

Then the transform will produce the schema which can be compared correctly via test framework.

shivdeep-singh-ibm avatar Jul 12 '24 06:07 shivdeep-singh-ibm

So it is not really a bug. It is a feature of PyArrow. Can we document a solution and close it. Library does not support pandas. If you use pandas for local transforms, its your responsibility to build table correctly

blublinsky avatar Jul 12 '24 12:07 blublinsky

Got it, resolved. Thanks

agoyal26 avatar Sep 10 '24 04:09 agoyal26