data-prep-kit
data-prep-kit copied to clipboard
[Bug] Test framework failing with "Schema of the two tables is not the same" while columns match
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Library/core
What happened + What you expected to happen
I expected test to pass when column names are same for output and expected tables but its failing with error code : "Schema of the two tables is not the same"
Reproduction script
import pyarrow as pa import pandas as pd df = pd.read_parquet("test-data/input/sample_1.parquet") table = pa.Table.from_pandas(df) table.schema document_id: string contents: string document: string title: string date_acquired: string repo_name: string license: string language: string ext: string size: int64 dataset: string -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1523 type(table.schema) <class 'pyarrow.lib.Schema'> type(dir(table.schema)) <class 'list'> df2 = pd.read_parquet("test-data/expected/sample_1.parquet") table2 = pa.Table.from_pandas(df2) table2.schema document_id: string contents: string document: string title: string date_acquired: string repo_name: string license: string language: string ext: string size: int64 dataset: string line_mean: double line_max: int64 total_num_lines: int64 avg_longest_lines: double alphanum_frac: double char_token_ratio: double autogenerated: bool config_or_test: bool has_no_keywords: bool has_few_assignments: bool is_xml: bool is_html: bool has_ast: bool code_complexity: string -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 3234 type(table.schema) <class 'pyarrow.lib.Schema'>
Anything else
No response
OS
MacOS (limited support)
Python
3.10.x
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Thanks for the submission. However, I don't understand your reproduction script. It does not seem to use any of the data-prep-kit library. Can you provide a set of CLI or other commands and the python file you're running and where you're running it, etc?
This issue is seen when transform uses pandas and pandas dataframe is converted to parquet table..
# pyarrow table to pandas
df = table.to_pandas()
new_df = process_with_some_func(df)
out_table = pa.Table.from_pandas(new_df)
# This step adds a new __index__ to the table which changes the schema, so
the default option is the culprit here.
How to solve this:
Don't use default option while converting from pandas, use
preserve_index=False
out_table = pa.Table.from_pandas(new_df, preserve_index=False)
Then the transform will produce the schema which can be compared correctly via test framework.
So it is not really a bug. It is a feature of PyArrow. Can we document a solution and close it. Library does not support pandas. If you use pandas for local transforms, its your responsibility to build table correctly
Got it, resolved. Thanks