kaskada bug: Decimal columns dropped from Parquet

Description When I create a table and load the attached Parquet file, the transfer_value column is not present in the loaded data.

Here's the file's contents

import pandas as pd
pd.read_parquet("nft_df.parquet")

This displays the file's contents - includes a final column named transfer_value

                                          entity_id  transfer_block_number                            contract_address token_id       transfer_value
0    0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|314               17183150  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a      314  1041900000000000000
1   0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|2095               17181436  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a     2095                    0
2    0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|271               17180562  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a      271                    0
3    0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|493               17174850  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a      493  1037900000000000000
4   0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a|1538               17173914  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a     1538                    0
..                                              ...                    ...                                         ...      ...                  ...
95  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|7059               17180319  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51     7059                    0
96  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|1357               17180319  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51     1357                    0
97  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|1335               17180319  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51     1335                    0
98   0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|767               17178469  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51      767                    0
99  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51|7317               17178469  0xdbc61a1685c6f70c11cbb7da70338352b1fa4d51     7317                    0

Now, let's create a table for the data and load it.

import pandas as pd
from kaskada.api.session import LocalBuilder
from kaskada import table as ktable
from kaskada import query

ksession = LocalBuilder().build()

ktable.create_table(
  table_name = "NFTEvents2",
  entity_key_column_name = "entity_id",
  time_column_name = "transfer_block_number",
)

ktable.load("NFTEvents2", "nft_df.parquet")

query.create_query("NFTEvents2")

pd.read_parquet("/path/to/result.parquet")

This produces the following result (notice there's no transfer_value column):

                           _time              _subsort             _key_hash  ... transfer_block_number                            contract_address  token_id
0  1970-01-01 00:00:00.017149235  10214858896208958069   5737523040439182766  ...              17149235  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      2133
1  1970-01-01 00:00:00.017149235  10214858896208958079   5737523040439182766  ...              17149235  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      2133
2  1970-01-01 00:00:00.017149235  10214858896208958089   5737523040439182766  ...              17149235  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      2133
3  1970-01-01 00:00:00.017149235  10214858896208958099   5737523040439182766  ...              17149235  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      2133
4  1970-01-01 00:00:00.017149235  10214858896208958109   5737523040439182766  ...              17149235  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      2133
..                           ...                   ...                   ...  ...                   ...                                         ...       ...
95 1970-01-01 00:00:00.017183149  10214858896208958124  17866332839928913180  ...              17183149  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      3874
96 1970-01-01 00:00:00.017183149  10214858896208958125  11355771492542101722  ...              17183149  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      1543
97 1970-01-01 00:00:00.017183149  10214858896208958126  11050314825651612082  ...              17183149  0x40a0f793ccfd9bc6f2532d06d022be0f978ecbf3      1279
98 1970-01-01 00:00:00.017183150  10214858896208958040  14759331064377829079  ...              17183150  0xcb45dce241a6f6f1e874e9925bea5840f3fd7d3a       314
99 1970-01-01 00:00:00.017183150  10214858896208958050  11196409809721123352  ...              17183150  0xf486f696b80164b5943191236eca114f4efab2ff      1660

File (unzip first): nft_df.parquet.zip

May 04 '23 13:05 kerinin

Thanks for @kerinin for filing and providing steps to reproduce the bug. I am able to reproduce the bug. Here's some further investigation into the issue:

I took the provided parquet file and ran the same query. The schema returned aligns with potentially a problem in prepare. Screenshot 2023-05-04 at 8 56 54 AM

Furthermore, the prepared files appears to drop the column as well. I've attached the prepared version of the parquet file provided. prepared_table.parquet.zip Screenshot 2023-05-04 at 8 58 29 AM

I'll continue looking at the issue.

May 04 '23 13:05 kevinjnguyen

Unfortunately, the results are due to our lack of support for the Decimal datatype. During data preparation, we filter and drop columns that are not supported by the system.

The local run of the preparation resulted in the following logs: 2023-05-04T08-52-14-engine-stdout.log. If we do not support the column, we log as an INFO that the column is dropped:

[2m2023-05-04T13:52:21.300806Z[0m [32m INFO[0m Discarding decimal column 'transfer_value'

Running pqrs on the provided parquet file, we get the following schema:

Metadata for file: nft_df.parquet

version: 2
num of rows: 100
created by: parquet-cpp-arrow version 10.0.1
metadata:
  pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 100, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "entity_id", "field_name": "entity_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "transfer_block_number", "field_name": "transfer_block_number", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "contract_address", "field_name": "contract_address", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "token_id", "field_name": "token_id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "transfer_value", "field_name": "transfer_value", "pandas_type": "decimal", "numpy_type": "object", "metadata": {"precision": 19, "scale": 0}}], "creator": {"library": "pyarrow", "version": "10.0.1"}, "pandas_version": "1.3.5"}
  ARROW:schema: /////3gFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAAwEAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAANcDAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDEwMCwgInN0ZXAiOiAxfV0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJlbmNvZGluZyI6ICJVVEYtOCJ9fV0sICJjb2x1bW5zIjogW3sibmFtZSI6ICJlbnRpdHlfaWQiLCAiZmllbGRfbmFtZSI6ICJlbnRpdHlfaWQiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHJhbnNmZXJfYmxvY2tfbnVtYmVyIiwgImZpZWxkX25hbWUiOiAidHJhbnNmZXJfYmxvY2tfbnVtYmVyIiwgInBhbmRhc190eXBlIjogImludDY0IiwgIm51bXB5X3R5cGUiOiAiaW50NjQiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogImNvbnRyYWN0X2FkZHJlc3MiLCAiZmllbGRfbmFtZSI6ICJjb250cmFjdF9hZGRyZXNzIiwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRva2VuX2lkIiwgImZpZWxkX25hbWUiOiAidG9rZW5faWQiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidHJhbnNmZXJfdmFsdWUiLCAiZmllbGRfbmFtZSI6ICJ0cmFuc2Zlcl92YWx1ZSIsICJwYW5kYXNfdHlwZSI6ICJkZWNpbWFsIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogeyJwcmVjaXNpb24iOiAxOSwgInNjYWxlIjogMH19XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEwLjAuMSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMS4zLjUifQAFAAAACAEAAKwAAAB0AAAARAAAAAQAAAAc////AAABBxAAAAAoAAAABAAAAAAAAAAOAAAAdHJhbnNmZXJfdmFsdWUAAAAABgAIAAQABgAAABMAAABY////AAABBRAAAAAcAAAABAAAAAAAAAAIAAAAdG9rZW5faWQAAAAASP///4T///8AAAEFEAAAACQAAAAEAAAAAAAAABAAAABjb250cmFjdF9hZGRyZXNzAAAAAHz///+4////AAABAhAAAAAwAAAABAAAAAAAAAAVAAAAdHJhbnNmZXJfYmxvY2tfbnVtYmVyAAAACAAMAAgABwAIAAAAAAAAAUAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQUQAAAAIAAAAAQAAAAAAAAACQAAAGVudGl0eV9pZAAAAAQABAAEAAAAAAAAAA==
message schema {
  OPTIONAL BYTE_ARRAY entity_id (STRING);
  OPTIONAL INT64 transfer_block_number;
  OPTIONAL BYTE_ARRAY contract_address (STRING);
  OPTIONAL BYTE_ARRAY token_id (STRING);
  OPTIONAL FIXED_LEN_BYTE_ARRAY (9) transfer_value (DECIMAL(19,0));
}

It appears that the output from pandas resulted in the decimal column. Unfortunately, we do not support decimal columns (we should file a feature request for this).

Suggestion: Given that the file is provided from pandas, we should be able to convert the datatype from a decimal to an int. The workaround would be to cast the column to an int type until we support the decimal type.

We should also be more transparent about the fact a column was dropped in preparation as well.

May 04 '23 14:05 kevinjnguyen

In the immediate-term, could we change from silently dropping the column to returning an error on data load? We could suggest converting the decimal to a float in the error message. Silently dropping data isn't a great user experience.

May 04 '23 15:05 kerinin

From the old repo:

As part of the quick-fix for #321 (closed) we dropped decimal columns. At some point, we should figure out how to handle them. Likely medium term solution is to cast them to a float, but this isn't supported by Arrow yet. This would allow using the columns. The error is likely less critical in our context, since the model is likely to be trained on floating point values between 0 and 1. The long term solution is probably to support Decimal like SQL.

Ben said:

Grooming Note: Arrow may have added support for decimal columns. Arrow2 may already have it as well. We are still dropping. These haven't come up frequently as an ask. One thing to consider here is that it would add a lot of complexity to arithmetic. What happens when you add two decimals with different precision, etc.

Potentially relevant:

https://github.com/apache/arrow-rs/issues/131
https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Decimal256

May 04 '23 15:05 kerinin

One of the main difficulties here is determining the output precision. Many operations have different rules. This would likely need to be part of the type system, or allow arithmetic to have special rules for operating on decimal types. We should do this, but it isn't a small amount of work.

One idea for a "quick fix" which we could do (and file a feature request for full support) would be to treat the decimal type as a mostly opaque type, but still pull it into the system. Instead of being able to do arithmetic directly on it, we could only support explicit casts:

Cast to an decimal as f64 to work with it as a float (eg., for ML where we're going to float anyway)
Cast to an integer decimal as i64 if we're OK rounding.
decimal_to_int_parts(decimal, precision) to produce an i64 representing the number of 0.1^precision units. For instance, decimal_to_int_parts(decimal, 2) would produce the number of cents (0.01) in the decimal.

This would allow us to support decimal columns for now, with explicit conversion to supported types, and introduce math operations later. If we got this route, we would likely also want to have a useful error message for things like decimal + decimal to say "arithmetic on decimals is not supported -- convert to a number using one of these options" (which we could do by detecting cases where the decimal type is converted to a number or ordered).

We'd also need to figure out how to reflect the scale in the type -- possibly decimal[n] or some such.

May 08 '23 16:05 bjchambers

In the latest changes (#323), the flow now fails at file ingestion rather than implicitly dropping. The next steps are to display a more friendly error message rather than an INTERNAL ERROR.

May 24 '23 01:05 kevinjnguyen

kaskada kaskada copied to clipboard

bug: Decimal columns dropped from Parquet

kaskada
kaskada copied to clipboard