kaskada
kaskada copied to clipboard
bug: Decimal columns dropped from Parquet
Description
When I create a table and load the attached Parquet file, the transfer_value column is not present in the loaded data.
Here's the file's contents
import pandas as pd
pd.read_parquet("nft_df.parquet")
This displays the file's contents - includes a final column named transfer_value
entity_id transfer_block_number contract_address token_id transfer_value
0 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|314 17183150 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 314 1041900000000000000
1 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|2095 17181436 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 2095 0
2 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|271 17180562 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 271 0
3 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|493 17174850 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 493 1037900000000000000
4 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|1538 17173914 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 1538 0
.. ... ... ... ... ...
95 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|7059 17180319 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51 7059 0
96 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|1357 17180319 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51 1357 0
97 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|1335 17180319 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51 1335 0
98 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|767 17178469 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51 767 0
99 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|7317 17178469 0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51 7317 0
Now, let's create a table for the data and load it.
import pandas as pd
from kaskada.api.session import LocalBuilder
from kaskada import table as ktable
from kaskada import query
ksession = LocalBuilder().build()
ktable.create_table(
table_name = "NFTEvents2",
entity_key_column_name = "entity_id",
time_column_name = "transfer_block_number",
)
ktable.load("NFTEvents2", "nft_df.parquet")
query.create_query("NFTEvents2")
pd.read_parquet("/path/to/result.parquet")
This produces the following result (notice there's no transfer_value column):
_time _subsort _key_hash ... transfer_block_number contract_address token_id
0 1970-01-01 00:00:00.017149235 10214858896208958069 5737523040439182766 ... 17149235 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 2133
1 1970-01-01 00:00:00.017149235 10214858896208958079 5737523040439182766 ... 17149235 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 2133
2 1970-01-01 00:00:00.017149235 10214858896208958089 5737523040439182766 ... 17149235 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 2133
3 1970-01-01 00:00:00.017149235 10214858896208958099 5737523040439182766 ... 17149235 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 2133
4 1970-01-01 00:00:00.017149235 10214858896208958109 5737523040439182766 ... 17149235 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 2133
.. ... ... ... ... ... ... ...
95 1970-01-01 00:00:00.017183149 10214858896208958124 17866332839928913180 ... 17183149 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 3874
96 1970-01-01 00:00:00.017183149 10214858896208958125 11355771492542101722 ... 17183149 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 1543
97 1970-01-01 00:00:00.017183149 10214858896208958126 11050314825651612082 ... 17183149 0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3 1279
98 1970-01-01 00:00:00.017183150 10214858896208958040 14759331064377829079 ... 17183150 0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a 314
99 1970-01-01 00:00:00.017183150 10214858896208958050 11196409809721123352 ... 17183150 0xf486f696b80164b5943191236eca114f4efab2ff 1660
File (unzip first): nft_df.parquet.zip
Thanks for @kerinin for filing and providing steps to reproduce the bug. I am able to reproduce the bug. Here's some further investigation into the issue:
I took the provided parquet file and ran the same query. The schema returned aligns with potentially a problem in prepare.

Furthermore, the prepared files appears to drop the column as well. I've attached the prepared version of the parquet file provided. prepared_table.parquet.zip

I'll continue looking at the issue.
Unfortunately, the results are due to our lack of support for the Decimal datatype. During data preparation, we filter and drop columns that are not supported by the system.
The local run of the preparation resulted in the following logs: 2023-05-04T08-52-14-engine-stdout.log. If we do not support the column, we log as an INFO that the column is dropped:
[2m2023-05-04T13:52:21.300806Z[0m [32m INFO[0m Discarding decimal column 'transfer_value'
Running pqrs on the provided parquet file, we get the following schema:
Metadata for file: nft_df.parquet
version: 2
num of rows: 100
created by: parquet-cpp-arrow version 10.0.1
metadata:
pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 100, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "entity_id", "field_name": "entity_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "transfer_block_number", "field_name": "transfer_block_number", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "contract_address", "field_name": "contract_address", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "token_id", "field_name": "token_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "transfer_value", "field_name": "transfer_value", "pandas_type": "decimal", "numpy_type": "object", "metadata": {"precision": 19, "scale": 0}}], "creator": {"library": "pyarrow", "version": "10.0.1"}, "pandas_version": "1.3.5"}
ARROW:schema: /////3gFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAwEAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAANcDAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6ICJlbnRpdHlfaWQiLCAiZmllbGRfbmFtZSI6ICJlbnRpdHlfaWQiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHJhbnNmZXJfYmxvY2tfbnVtYmVyIiwgImZpZWxkX25hbWUiOiAidHJhbnNmZXJfYmxvY2tfbnVtYmVyIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImNvbnRyYWN0X2FkZHJlc3MiLCAiZmllbGRfbmFtZSI6ICJjb250cmFjdF9hZGRyZXNzIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRva2VuX2lkIiwgImZpZWxkX25hbWUiOiAidG9rZW5faWQiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHJhbnNmZXJfdmFsdWUiLCAiZmllbGRfbmFtZSI6ICJ0cmFuc2Zlcl92YWx1ZSIsICJwYW5kYXNfdHlwZSI6ICJkZWNpbWFsIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJwcmVjaXNpb24iOiAxOSwgInNjYWxlIjogMH19XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEwLjAuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjUifQAFAAAACAEAAKwAAAB0AAAARAAAAAQAAAAc////AAABBxAAAAAoAAAABAAAAAAAAAAOAAAAdHJhbnNmZXJfdmFsdWUAAAAABgAIAAQABgAAABMAAABY////AAABBRAAAAAcAAAABAAAAAAAAAAIAAAAdG9rZW5faWQAAAAASP///4T///8AAAEFEAAAACQAAAAEAAAAAAAAABAAAABjb250cmFjdF9hZGRyZXNzAAAAAHz///+4////AAABAhAAAAAwAAAABAAAAAAAAAAVAAAAdHJhbnNmZXJfYmxvY2tfbnVtYmVyAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAIAAAAAQAAAAAAAAACQAAAGVudGl0eV9pZAAAAAQABAAEAAAAAAAAAA==
message schema {
OPTIONAL BYTE_ARRAY entity_id (STRING);
OPTIONAL INT64 transfer_block_number;
OPTIONAL BYTE_ARRAY contract_address (STRING);
OPTIONAL BYTE_ARRAY token_id (STRING);
OPTIONAL FIXED_LEN_BYTE_ARRAY (9) transfer_value (DECIMAL(19,0));
}
It appears that the output from pandas resulted in the decimal column. Unfortunately, we do not support decimal columns (we should file a feature request for this).
Suggestion: Given that the file is provided from pandas, we should be able to convert the datatype from a decimal to an int. The workaround would be to cast the column to an int type until we support the decimal type.
We should also be more transparent about the fact a column was dropped in preparation as well.
In the immediate-term, could we change from silently dropping the column to returning an error on data load? We could suggest converting the decimal to a float in the error message. Silently dropping data isn't a great user experience.
From the old repo:
As part of the quick-fix for #321 (closed) we dropped decimal columns. At some point, we should figure out how to handle them. Likely medium term solution is to cast them to a float, but this isn't supported by Arrow yet. This would allow using the columns. The error is likely less critical in our context, since the model is likely to be trained on floating point values between 0 and 1. The long term solution is probably to support Decimal like SQL.
Ben said:
Grooming Note: Arrow may have added support for decimal columns. Arrow2 may already have it as well. We are still dropping. These haven't come up frequently as an ask. One thing to consider here is that it would add a lot of complexity to arithmetic. What happens when you add two decimals with different precision, etc.
Potentially relevant:
- https://github.com/apache/arrow-rs/issues/131
- https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Decimal256
One of the main difficulties here is determining the output precision. Many operations have different rules. This would likely need to be part of the type system, or allow arithmetic to have special rules for operating on decimal types. We should do this, but it isn't a small amount of work.
One idea for a "quick fix" which we could do (and file a feature request for full support) would be to treat the decimal type as a mostly opaque type, but still pull it into the system. Instead of being able to do arithmetic directly on it, we could only support explicit casts:
- Cast to an
decimal as f64to work with it as a float (eg., for ML where we're going to float anyway) - Cast to an integer
decimal as i64if we're OK rounding. decimal_to_int_parts(decimal, precision)to produce ani64representing the number of0.1^precisionunits. For instance,decimal_to_int_parts(decimal, 2)would produce the number of cents (0.01) in the decimal.
This would allow us to support decimal columns for now, with explicit conversion to supported types, and introduce math operations later. If we got this route, we would likely also want to have a useful error message for things like decimal + decimal to say "arithmetic on decimals is not supported -- convert to a number using one of these options" (which we could do by detecting cases where the decimal type is converted to a number or ordered).
We'd also need to figure out how to reflect the scale in the type -- possibly decimal[n] or some such.
In the latest changes (#323), the flow now fails at file ingestion rather than implicitly dropping. The next steps are to display a more friendly error message rather than an INTERNAL ERROR.