chdb icon indicating copy to clipboard operation
chdb copied to clipboard

Support complex type in Python Table Engine like Array and Map, Object

Open auxten opened this issue 1 year ago • 2 comments

Code snippet to reproduce:

df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet',
                     columns=["id","community","level","title","summary", "findings","rank","rank_explanation"])

batched_import(statement, df)
df.head(2)

df.dtypes

import chdb

df=df.astype({'id': 'string'})
df2=df[["id","level","rank"]]
chdb.query("FROM Python(df2) SELECT max(id),sum(level),avg(rank)  group by all").show()
df.dtypes

parquet-test-data.zip

auxten avatar Jul 24 '24 11:07 auxten

Data seems contain dict type, debuging

                                      id community  level  ...                                           findings rank                                   rank_explanation
0   e7822326-4da8-4954-afa9-be7f4f5791a5        42      2  ...  [{'explanation': 'Marley's Ghost plays a cruci...  8.0  The impact severity rating is high due to the ...
1   8a5afac1-99ef-4f01-a1b1-f044ce392ff9        43      2  ...  [{'explanation': 'The Ghost, identified at tim...  8.5  The impact severity rating is high due to the ...
2   83fc3bf3-ce93-4feb-8bb1-b278b062c672        44      2  ...  [{'explanation': 'Mr. Scrooge's character unde...  8.5  The high impact severity rating reflects the p...
3   86d8af0d-738c-45b4-a36c-ec9cd25b56dd        45      2  ...  [{'explanation': 'Bob Cratchit is depicted as ...  7.5  The impact severity rating is relatively high ...
4   77882794-545d-4aba-98c7-2f1294a9e8df        46      2  ...  [{'explanation': 'Martha is a central figure i...  7.5  The impact severity rating is moderately high ...
5   829803bc-0928-4825-9c5a-d07284f8309d        41      2  ...  [{'explanation': 'Ebenezer Scrooge's story beg...  8.5  The high impact severity rating reflects the p...
6   834f9502-3bfd-4500-91c5-0cb14a7ae254        20      1  ...  [{'explanation': 'Jacob Marley's appearance to...  7.5  The impact severity rating is high due to the ...
7   1df0b304-686f-4ac0-92ce-5a9244ad36cf        21      1  ...  [{'explanation': 'The Ghost, often interpreted...  8.5  The impact severity rating is high due to The ...
8   24802e92-18e3-4638-90d0-3872159110dd        22      1  ...  [{'explanation': 'The firm 'Scrooge and Marley...  4.0  The impact severity rating is moderate due to ...
9   bd71ffb7-2324-4987-9947-dc3c49809e18        23      1  ...  [{'explanation': 'Scrooge wields considerable ...  4.0  The impact severity rating is moderate, reflec...

auxten avatar Jul 29 '24 09:07 auxten

Currently, ClickHouse will read the column "findings" as Array(Tuple(explanation Nullable(String), summary Nullable(String))) from parquet. If the Semistructured Columns is done. Maybe chDB can use JSON type for this kind of column which is more generic and easy to handle in Python.

So, I will wait for https://github.com/ClickHouse/ClickHouse/pull/66444

auxten avatar Aug 06 '24 08:08 auxten