polars
polars copied to clipboard
Panic when mismatching types between glob files
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
Log output
(med-data) user@macos:~/git/med-data $ POLARS_VERBOSE=1 rp
Python 3.10.11 (main, May 7 2023, 18:32:37) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
thread 'thread 'polars-4polars-0' panicked at ' panicked at crates/polars-parquet/src/arrow/read/statistics/mod.rs/rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs::376250::435:
:
called `Option::unwrap()` on a `None` valueExpected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
0: 0x1114238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
1: 0x10ec8239b - core::fmt::write::h4a73583a3886d3b0
2: 0x1113f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
3: 0x1114279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
4: 0x111427269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
5: 0x111428f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
6: 0x111427cda - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
7: 0x111427c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
8: 0x111427c56 - _rust_begin_unwind
9: 0x1115e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
10: 0x1115e38e4 - core::panicking::panic::hb3e838924bd2f646
11: 0x1115e3ca8 - core::option::unwrap_failed::h8fd98a81a93ecfe7
12: 0x110973982 - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
13: 0x10fb59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
14: 0x10fb5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
15: 0x10ffbe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
16: 0x10ffc0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
17: 0x111931120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
18: 0x1111ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
19: 0x1111ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
20: 0x11142c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
21: 0x7ff803c5d18b - __pthread_start
stack backtrace:
0: 0x1114238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
1: 0x10ec8239b - core::fmt::write::h4a73583a3886d3b0
2: 0x1113f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
3: 0x1114279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
4: 0x111427269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
5: 0x111428f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
6: 0x111427d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
7: 0x111427c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
8: 0x111427c56 - _rust_begin_unwind
9: 0x1115e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
10: 0x1109767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
11: 0x110972ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
12: 0x10fb59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
13: 0x10fb5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
14: 0x10ffbe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
15: 0x10ffc0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
16: 0x111931120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
17: 0x1111ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
18: 0x1111ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
19: 0x11142c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
20: 0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
>>>
Issue description
The issue does not exist if I remove the drop_nulls part, e.g.
pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).collect()
The issue does not exist if I change the glob part to ANY specific parquet file, the issue does not exist, e.g.
>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
shape: (994_078, 2)
┌─────────────────────────────────┬──────────────────────────────┐
│ Genericname ┆ Diagnosis │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════╪══════════════════════════════╡
│ 灯盏生脉胶囊 ┆ 类风湿性关节炎;心绞痛;银屑病 │
│ 头孢克洛分散片 ┆ 皮肤感染;皮肤裂伤 │
│ 阿司匹林肠溶片;甲硝唑片;牙痛停 ┆ 牙周炎 │
│ 滴丸 ┆ │
│ 奥硝唑分散片;头孢泊肟酯胶囊 ┆ 阑尾炎 │
│ 玻璃酸钠滴眼液;肠胃宁片 ┆ 干眼症;泄泻病 │
│ … ┆ … │
│ 达格列净片;复方酮康唑发用洗剂 ┆ 糖尿病;头皮糠疹 │
│ 六神丸;维生素A软胶囊 ┆ 痤疮;咽炎 │
│ 甲钴胺片;腰痛宁胶囊;依托考昔片 ┆ 腰椎病 │
│ 急支糖浆;盐酸氨溴索糖浆 ┆ 上呼吸道感染 │
│ 桂枝茯苓丸(浓缩水丸);血府逐瘀颗 ┆ 闭经;血瘀证 │
│ 粒 ┆ │
└─────────────────────────────────┴──────────────────────────────┘
>>>
Expected behavior
No panics.
Installed versions
--------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: macOS-14.5-x86_64-i386-64bit Python: 3.10.11 (main, May 7 2023, 18:32:37) [Clang 16.0.3 ]
----Optional dependencies----
adbc_driver_manager:
@coastalwhite we don't have a repro, but we do have a panic on statistics unwrap. Maybe you know what it is?
It is difficult to see, but there are two panics here I think.
- An unwrap of a
Option::Noneatcrates/polars-parquet/src/arrow/read/statistics/mod.rs:376. - An
expect_as_binary, I suspect at linescrates/polars-parquet/src/arrow/read/statistics/mod.rs, somewhere between 527 and 532.
I don't see an immediate problem, but since the problem only happens when globbing there might be a schema mismatch?
Hello, there are total 3 files. Not sure if these information helps.
user@macos:~/git/med-data $ ll data/*-*-*-*-*-*.parquet
-rw-r--r-- 1 user staff 41M Oct 20 2021 data/2020-02-04-2020-11-01.parquet
-rw-r--r-- 1 user staff 35M Oct 20 2021 data/2020-11-01-2021-03-01.parquet
-rw-r--r-- 1 user staff 59M Oct 20 2021 data/2021-03-01-2021-09-05.parquet
user@macos:~/git/med-data $ rp
Python 3.10.11 (main, May 7 2023, 18:32:37) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (789_880, 2)
┌─────────────────────────────────┬──────────────────────────┐
│ Genericname ┆ Diagnosis │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════╪══════════════════════════╡
│ 磷酸奥司他韦颗粒 ┆ 预防性抗流行性感冒治疗 │
│ 硝苯地平控释片 ┆ 原发性高血压 │
│ 富马酸替诺福韦二吡呋酯片 ┆ 慢性乙型肝炎 │
│ 苯磺酸氨氯地平片 ┆ 原发性高血压 │
│ 布地奈德福莫特罗粉吸入剂 ┆ 哮喘 │
│ … ┆ … │
│ 金匮肾气丸;尿感宁颗粒 ┆ 尿路感染;肾气不足证 │
│ 地奈德乳膏;非洛地平缓释片;复方 ┆ 高血压病;气滞血瘀证;湿疹 │
│ 丹参滴丸 ┆ │
│ 牛黄解毒片;蒲地蓝消炎口服液;头 ┆ 牙龈炎 │
│ 孢呋辛酯胶囊 ┆ │
│ 地奈德乳膏;替米沙坦片 ┆ 高血压;脂溢性皮炎 │
│ 头孢氨苄片 ┆ 毛囊炎;中耳炎 │
└─────────────────────────────────┴──────────────────────────┘
>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (601_951, 2)
┌─────────────────────────────────┬───────────────────────────────┐
│ Genericname ┆ Diagnosis │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════╪═══════════════════════════════╡
│ 复方酮康唑软膏;鲜竹沥 ┆ 皮肤真菌感染;上呼吸道感染 │
│ 甲钴胺分散片;双氯芬酸钠缓释胶囊 ┆ 腰椎间盘突出 │
│ 氨溴特罗口服溶液;孟鲁司特钠片 ┆ 上呼吸道感染;上呼吸道过敏反应 │
│ 护肝片;心脑康胶囊 ┆ 肝气郁结证;瘀血阻络证 │
│ 奥硝唑片;双氯芬酸钠缓释胶囊;头 ┆ 慢性牙周炎 │
│ 孢克洛分散片 ┆ │
│ … ┆ … │
│ 埃索美拉唑镁肠溶片;玻璃酸钠滴眼 ┆ 干眼症;十二指肠溃疡 │
│ 液 ┆ │
│ 丹黄祛瘀胶囊;散结镇痛胶囊 ┆ 血瘀证;子宫内膜异位症 │
│ 陈香露白露片 ┆ 慢性胃炎;特指急性胃炎 │
│ 玉龙油 ┆ 关节炎;痛风 │
│ 罗红霉素片;清热散结片 ┆ 口腔溃疡;皮肤感染 │
└─────────────────────────────────┴───────────────────────────────┘
>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (994_078, 2)
┌─────────────────────────────────┬──────────────────────────────┐
│ Genericname ┆ Diagnosis │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════╪══════════════════════════════╡
│ 灯盏生脉胶囊 ┆ 类风湿性关节炎;心绞痛;银屑病 │
│ 头孢克洛分散片 ┆ 皮肤感染;皮肤裂伤 │
│ 阿司匹林肠溶片;甲硝唑片;牙痛停 ┆ 牙周炎 │
│ 滴丸 ┆ │
│ 奥硝唑分散片;头孢泊肟酯胶囊 ┆ 阑尾炎 │
│ 玻璃酸钠滴眼液;肠胃宁片 ┆ 干眼症;泄泻病 │
│ … ┆ … │
│ 达格列净片;复方酮康唑发用洗剂 ┆ 糖尿病;头皮糠疹 │
│ 六神丸;维生素A软胶囊 ┆ 痤疮;咽炎 │
│ 甲钴胺片;腰痛宁胶囊;依托考昔片 ┆ 腰椎病 │
│ 急支糖浆;盐酸氨溴索糖浆 ┆ 上呼吸道感染 │
│ 桂枝茯苓丸(浓缩水丸);血府逐瘀颗 ┆ 闭经;血瘀证 │
│ 粒 ┆ │
└─────────────────────────────────┴──────────────────────────────┘
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).describe()
shape: (9, 3)
┌────────────┬───────────────────────┬─────────────────────────────────┐
│ statistic ┆ Genericname ┆ Diagnosis │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════╪═══════════════════════╪═════════════════════════════════╡
│ count ┆ 789880 ┆ 789880 │
│ null_count ┆ 0 ┆ 0 │
│ mean ┆ null ┆ null │
│ std ┆ null ┆ null │
│ min ┆ ;特非那定片
 ┆ 肠炎 ;上呼吸道感染 │
│ 25% ┆ null ┆ null │
│ 50% ┆ null ┆ null │
│ 75% ┆ null ┆ null │
│ max ┆ (畅迪5号)粉尘螨滴剂 ┆ A族高甘油三脂血症;高血压病;脑 │
│ ┆ ┆ 梗死后遗症 │
└────────────┴───────────────────────┴─────────────────────────────────┘
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").describe()
shape: (9, 17)
┌────────────┬───────────────┬──────────────────────┬──────────────────────┬───┬─────────┬───────────┬───────────┬──────────────────────┐
│ statistic ┆ Id ┆ Genericname ┆ Diagnosis ┆ … ┆ Checker ┆ CheckTime ┆ Confirmer ┆ ConfirmTime │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │
╞════════════╪═══════════════╪══════════════════════╪══════════════════════╪═══╪═════════╪═══════════╪═══════════╪══════════════════════╡
│ count ┆ 789880.0 ┆ 789880 ┆ 789880 ┆ … ┆ 601427 ┆ 601427 ┆ 601427 ┆ 601427 │
│ null_count ┆ 0.0 ┆ 0 ┆ 0 ┆ … ┆ 188453 ┆ 188453 ┆ 188453 ┆ 188453 │
│ mean ┆ 419382.573516 ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ std ┆ 234739.505744 ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ min ┆ 1.0 ┆ ;特非那定片
 ┆ 肠炎 ;上呼吸道感染 ┆ … ┆ 何黎敏 ┆ 2020/10/1 ┆ 何黎敏 ┆ 2020/10/1 15:00:13 │
│ ┆ ┆ ┆ ┆ ┆ ┆ 14:29:43 ┆ ┆ │
│ 25% ┆ 217296.0 ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 50% ┆ 421387.0 ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 75% ┆ 622312.0 ┆ null ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │
│ max ┆ 823498.0 ┆ (畅迪5号)粉尘螨滴 ┆ A族高甘油三脂血症; ┆ … ┆ 黄羡 ┆ 2020/9/30 ┆ 黄羡 ┆ 2020/9/30 15:11:42 │
│ ┆ ┆ 剂 ┆ 高血压病;脑梗死后遗 ┆ ┆ ┆ 15:22:40 ┆ ┆ │
│ ┆ ┆ ┆ 症 ┆ ┆ ┆ ┆ ┆ │
└────────────┴───────────────┴──────────────────────┴──────────────────────┴───┴─────────┴───────────┴───────────┴──────────────────────┘
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().collect().columns
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
0: 0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
1: 0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
2: 0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
3: 0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
4: 0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
5: 0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
6: 0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
7: 0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
8: 0x111327c56 - _rust_begin_unwind
9: 0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
10: 0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
11: 0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
12: 0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
13: 0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
14: 0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
15: 0x10febf682 - rayon_core::join::join_context::{{closure}}::hdb785e885a11ecf5
16: 0x10febec58 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
17: 0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
18: 0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
19: 0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
20: 0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
21: 0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
22: 0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
>>>
One thing I notice here is that there are columns that are typed as strings but contain numbers. Could it maybe be that one of the files has the same column but with different types?
That seems to be the issue.
>>> files = ["data/2020-02-04-2020-11-01.parquet", "data/2020-11-01-2021-03-01.parquet", "data/2021-03-01-2021-09-05.parquet"]
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect().row(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented
>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented
>>> import pandas as pd
>>> for f in files:
... df = pd.read_parquet(f)
... print(df.info())
...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789880 entries, 0 to 789879
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 789880 non-null int64
1 Genericname 789880 non-null object
2 Diagnosis 789880 non-null object
3 InquiryId 789880 non-null object
4 CreateTime 789880 non-null object
5 UpdateTime 9165 non-null object
6 InqCount 789880 non-null int64
7 Level 789880 non-null int64
8 UpdateBy 254 non-null object
9 Creater 789880 non-null object
10 Platform 712729 non-null object
11 Remark 15070 non-null object
12 Checker 601427 non-null object
13 CheckTime 601427 non-null object
14 Confirmer 601427 non-null object
15 ConfirmTime 601427 non-null object
dtypes: int64(3), object(13)
memory usage: 96.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 601952 entries, 0 to 601951
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 601952 non-null int64
1 Genericname 601952 non-null object
2 Diagnosis 601951 non-null object
3 InquiryId 601952 non-null int64 <----------------------------------------- DIFFERENCE
4 CreateTime 601952 non-null object
5 UpdateTime 1857 non-null object
6 InqCount 601952 non-null int64
7 Level 601952 non-null int64
8 UpdateBy 91 non-null object
9 Creater 601952 non-null object
10 Platform 599108 non-null object
11 Remark 8939 non-null object
12 Checker 599108 non-null object
13 CheckTime 599108 non-null object
14 Confirmer 599108 non-null object
15 ConfirmTime 599108 non-null object
dtypes: int64(4), object(12)
memory usage: 73.5+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994078 entries, 0 to 994077
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 994078 non-null int64
1 Genericname 994078 non-null object
2 Diagnosis 994078 non-null object
3 InquiryId 994078 non-null object
4 CreateTime 994078 non-null object
5 UpdateTime 2329 non-null object
6 InqCount 994078 non-null int64
7 Level 994078 non-null int64
8 UpdateBy 8 non-null object
9 Creater 994078 non-null object
10 Platform 981809 non-null object
11 Remark 2654 non-null object
12 Checker 981809 non-null object
13 CheckTime 981809 non-null object
14 Confirmer 981809 non-null object
15 ConfirmTime 981809 non-null object
dtypes: int64(3), object(13)
memory usage: 121.3+ MB
None
>>> pl.scan_parquet([files[0], files[2]]).collect()
shape: (1_783_958, 16)
┌─────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬───────────┬──────────────┬───────────────────┐
│ Id ┆ Genericname ┆ Diagnosis ┆ InquiryId ┆ … ┆ Checker ┆ CheckTime ┆ Confirmer ┆ ConfirmTime │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪═══════════╪══════════════╪═══════════════════╡
│ 1 ┆ 磷酸奥司他韦颗粒 ┆ 预防性抗流行性感 ┆ 0 ┆ … ┆ null ┆ null ┆ null ┆ null │
│ ┆ ┆ 冒治疗 ┆ ┆ ┆ ┆ ┆ ┆ │
│ 2 ┆ 硝苯地平控释片 ┆ 原发性高血压 ┆ 0 ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 4 ┆ 富马酸替诺福韦二 ┆ 慢性乙型肝炎 ┆ 0 ┆ … ┆ null ┆ null ┆ null ┆ null │
│ ┆ 吡呋酯片 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 5 ┆ 苯磺酸氨氯地平片 ┆ 原发性高血压 ┆ 0 ┆ … ┆ null ┆ null ┆ null ┆ null │
│ 6 ┆ 布地奈德福莫特罗 ┆ 哮喘 ┆ 0 ┆ … ┆ null ┆ null ┆ null ┆ null │
│ ┆ 粉吸入剂 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 2428039 ┆ 达格列净片;复方酮 ┆ 糖尿病;头皮糠疹 ┆ 14346925769166766 ┆ … ┆ 智能审方判断 ┆ 2021/9/6 ┆ 智能审方判断 ┆ 2021/9/6 9:39:23 │
│ ┆ 康唑发用洗剂 ┆ ┆ 96 ┆ ┆ ┆ 9:39:23 ┆ ┆ │
│ 2428040 ┆ 六神丸;维生素A软 ┆ 痤疮;咽炎 ┆ 3862545784759552 ┆ … ┆ 智能审方判断 ┆ 2021/9/6 ┆ 智能审方判断 ┆ 2021/9/6 9:39:28 │
│ ┆ 胶囊 ┆ ┆ ┆ ┆ ┆ 9:39:28 ┆ ┆ │
│ 2428041 ┆ 甲钴胺片;腰痛宁胶 ┆ 腰椎病 ┆ 3862545878415360 ┆ … ┆ 智能审方判断 ┆ 2021/9/6 ┆ 智能审方判断 ┆ 2021/9/6 9:39:35 │
│ ┆ 囊;依托考昔片 ┆ ┆ ┆ ┆ ┆ 9:39:35 ┆ ┆ │
│ 2428042 ┆ 急支糖浆;盐酸氨溴 ┆ 上呼吸道感染 ┆ 4347993676906752 ┆ … ┆ 智能审方判断 ┆ 2021/9/6 ┆ 智能审方判断 ┆ 2021/9/6 9:39:36 │
│ ┆ 索糖浆 ┆ ┆ ┆ ┆ ┆ 9:39:36 ┆ ┆ │
│ 2428043 ┆ 桂枝茯苓丸(浓缩水 ┆ 闭经;血瘀证 ┆ 14346923868894577 ┆ … ┆ 智能审方判断 ┆ 2021/9/6 ┆ 智能审方判断 ┆ 2021/9/6 9:39:41 │
│ ┆ 丸);血府逐瘀颗粒 ┆ ┆ 53 ┆ ┆ ┆ 9:39:41 ┆ ┆ │
└─────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴───────────┴──────────────┴───────────────────┘
>>> pl.scan_parquet([files[0], files[2]]).drop_nulls().collect()
shape: (48, 16)
┌────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬────────────┬──────────────┬───────────────────┐
│ Id ┆ Genericname ┆ Diagnosis ┆ InquiryId ┆ … ┆ Checker ┆ CheckTime ┆ Confirmer ┆ ConfirmTime │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │
╞════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪════════════╪══════════════╪═══════════════════╡
│ 210034 ┆ 利拉鲁肽注射液;缬 ┆ 缺血性脑血管病;糖 ┆ 3692776318482176 ┆ … ┆ 陈佩斯 ┆ 2020/5/15 ┆ 韩丽琴 ┆ 2020/5/15 │
│ ┆ 沙坦氢氯噻嗪胶囊; ┆ 尿病;原发性高血压 ┆ ┆ ┆ ┆ 4:04:59 ┆ ┆ 19:26:42 │
│ ┆ 银杏叶提取物片 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 210734 ┆ 阿奇霉素分散片;枸 ┆ 高血压病;男性勃起 ┆ 3692783911425792 ┆ … ┆ 陈佩斯 ┆ 2020/5/15 ┆ 唐明嵩 ┆ 2020/5/15 │
│ ┆ 橼酸西地那非片;马 ┆ 障碍;软组织感染 ┆ ┆ ┆ ┆ 2:57:29 ┆ ┆ 20:35:13 │
│ ┆ 来酸依那普利片;双 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ 氯芬酸… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 211476 ┆ 复方酮康唑发用洗 ┆ 肺动脉高压;甲状腺 ┆ 3692745775624705 ┆ … ┆ 陈佩斯 ┆ 2020/5/15 ┆ 唐明嵩 ┆ 2020/5/15 │
│ ┆ 剂;枸橼酸西地那非 ┆ 功能减退症;心绞痛 ┆ ┆ ┆ ┆ 4:59:10 ┆ ┆ 22:01:27 │
│ ┆ 片;通脉颗粒;左甲 ┆ ;脂溢性皮炎 ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ 状腺素钠… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 214903 ┆ 苯磺酸左氨氯地平 ┆ 高血压病;男性勃起 ┆ 12607392863288279 ┆ … ┆ 唐明嵩 ┆ 2020/5/16 ┆ 吴雪静 ┆ 2020/5/16 │
│ ┆ 片;枸橼酸西地那非 ┆ 障碍 ┆ 22 ┆ ┆ ┆ 16:17:25 ┆ ┆ 17:19:47 │
│ ┆ 片 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 220492 ┆ 富马酸比索洛尔片; ┆ 不稳定性心绞痛;男 ┆ 3693396635109120 ┆ … ┆ 陈祉羽 ┆ 2020/5/17 ┆ 苏锡茵 ┆ 2020/5/18 8:05:50 │
│ ┆ 枸橼酸西地那非片 ┆ 性勃起障碍 ┆ ┆ ┆ ┆ 14:54:51 ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 743981 ┆ 阿司匹林肠溶片;胱 ┆ 头皮糠疹;脱发 ┆ 13166628469894104 ┆ … ┆ 智能审方判断 ┆ 2020/10/15 ┆ 智能审方判断 ┆ 2020/10/15 │
│ ┆ 氨酸片 ┆ ┆ 02 ┆ ┆ ┆ 16:57:01 ┆ ┆ 16:57:01 │
│ 745923 ┆ 坎地沙坦酯片;马来 ┆ 高血压病 ┆ 3724199280330496 ┆ … ┆ 翁庸徳 ┆ 2020/10/10 ┆ 苏锡茵 ┆ 2020/10/15 │
│ ┆ 酸依那普利片 ┆ ┆ ┆ ┆ ┆ 18:19:03 ┆ ┆ 22:40:58 │
│ 751981 ┆ 酚酞片;牛黄解毒片 ┆ 便秘病;热毒证 ┆ 12906196614272082 ┆ … ┆ 黄羡 ┆ 2020/10/12 ┆ 翁庸徳 ┆ 2020/10/16 │
│ ┆ ┆ ┆ 79 ┆ ┆ ┆ 9:25:59 ┆ ┆ 22:29:34 │
│ 809815 ┆ 地特胰岛素注射液; ┆ 1型糖尿病;高血压 ┆ 3713215386539776 ┆ … ┆ 翁庸徳 ┆ 2020/10/28 ┆ 苏锡茵 ┆ 2020/10/29 │
│ ┆ 厄贝沙坦片;罗红霉 ┆ 病;支气管炎 ┆ ┆ ┆ ┆ 16:21:42 ┆ ┆ 11:17:13 │
│ ┆ 素氨溴索片 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 811401 ┆ 非诺贝特胶囊;门冬 ┆ 1型糖尿病;高脂血 ┆ 3732088374658816 ┆ … ┆ 翁庸徳 ┆ 2020/10/28 ┆ 苏锡茵 ┆ 2020/10/29 │
│ ┆ 胰岛素注射液 ┆ 症 ┆ ┆ ┆ ┆ 16:35:08 ┆ ┆ 17:47:00 │
└────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴────────────┴──────────────┴───────────────────┘
>>> pl.scan_parquet([files[0], files[1]]).drop_nulls().collect()
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
0: 0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
1: 0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
2: 0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
3: 0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
4: 0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
5: 0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
6: 0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
7: 0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
8: 0x111327c56 - _rust_begin_unwind
9: 0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
10: 0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
11: 0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
12: 0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
13: 0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
14: 0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
15: 0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
16: 0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
17: 0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
18: 0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
19: 0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
20: 0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
But is that mean even specific columns are selected, all the schema will be checked ?
pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
Is this error polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented relevant? Both error messages seems a bit hard for me to locate the problem.
@failable if this still occurs after #17321, can you open a new issue with a proper reproducable expample? We cannot take action on this one.
@ritchie46 Thanks, seems the issue has been fixed now!
When will we have a release? It took me an hour to build the main branch on my local machine.