Pre-aggregations containing nullable decimal measures fail in Athena Driver
Describe the bug I'm trying to pre-aggregate a measure that contains a decimal value, while other measures can be NULL, and I'm having the following error in pre-aggregation:
Error: Error during create table: CREATE TABLE prod_pre_aggregations.fct_my_table_name_main20241001_asdfghjkl ('list of fields and data types') WITH (input_format = 'csv_no_header', delimiter = '^A', build_range_end = '2024-10-31T23:59:59.999') LOCATION ?, ?: Create table failed: Can't parse '\N
4010' column value for 'fct_user_progress_contents__value_workload_total_completed' column: Internal: invalid float literal
at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
at WebSocket.emit (node:events:519:28)
at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
at Receiver.emit (node:events:519:28)
at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
at writeOrBuffer (node:internal/streams/writable:570:12)
at _write (node:internal/streams/writable:499:10)
at Receiver.Writable.write (node:internal/streams/writable:508:10)
at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
at TLSSocket.emit (node:events:519:28)
at addChunk (node:internal/streams/readable:559:12)
at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)
To Reproduce Steps to reproduce the behavior:
- Create a cube yml with the following metrics:
- name: value_workload_total
sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
type: sum
- name: value_workload_total_completed
sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
type: sum
filters:
- sql: flag_enrollment_completed_pathway = TRUE
- Put both in pre aggregation
- Run pre aggregation
Expected behavior I expect the metric to pre aggregate
Version: [e.g. 1.0.5]
Additional context I found a similar issue here: https://github.com/cube-js/cube/issues/6693 for snowflake driver, I wonder how to solve it for athena
Hi @Larissa-Rocha 👋
I've tried to reproduce this and while I wasn't able to get exactly the same error, I acknowledge that this case could have been handled better.
The data model that I used:
cubes:
- name: test
sql: >
SELECT
NULL AS value,
NULL AS value_workload_minutes,
NULL AS value_workload_minutes_count,
TRUE AS flag_enrollment_completed_pathway
dimensions:
- name: value
sql: value
type: string
measures:
- name: value_workload_total
sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
type: sum
- name: value_workload_total_completed
sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
type: sum
filters:
- sql: flag_enrollment_completed_pathway = TRUE
pre_aggregations:
- name: main
dimensions:
- value
measures:
- value_workload_total
- value_workload_total_completed
Pre-aggregation does not build—here's the error:
Error: Error during create table: CREATE TABLE prod_pre_aggregations.test_main_srp5fcpr_isgxjz4g_1jl16fr (`test__value` unknown) LOCATION ?: Custom type 'unknown' is not supported
at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
at WebSocket.emit (node:events:519:28)
at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
at Receiver.emit (node:events:519:28)
at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
at writeOrBuffer (node:internal/streams/writable:570:12)
at _write (node:internal/streams/writable:499:10)
at Receiver.Writable.write (node:internal/streams/writable:508:10)
at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
at TLSSocket.emit (node:events:519:28)
at addChunk (node:internal/streams/readable:559:12)
at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)
My Cube version was v1.1.7.
That being said, I guess that a possible workaround would be to safeguard against NULL values as much as possible. For example, all potentially NULL values can be surrounded with COALESCE(..., 0) or something similar so that they evaluate to 0 instead of NULL. Obviously, what to substitute NULLs with depends on your domain.
Hi @igorlukanin, after further investigation, I found that pre aggregation in columns containing special characters like ", á, â also fail with Can't parse column value error. Do you have any tips on how to escape those values or if this issue would be addressed any time soon? I'm using Cube Cloud 1.0.5
Hi @Larissa-Rocha 👋
I've quickly checked it with the following data model:
cubes:
- name: test_2
sql: >
SELECT 'á' AS value UNION ALL
SELECT 'â' AS value UNION ALL
SELECT '"' AS value
dimensions:
- name: value
sql: value
type: string
pre_aggregations:
- name: main
dimensions:
- value
And it worked as expected:
So I guess there's something else that should be causing that error. Could you please help narrow this down?
Hi @igorlukanin! We’ve identified an issue when using the schema below in Cube.js. The pre-aggregation configured to be partitioned by month encounters errors when Cube attempts to create a CSV file to store the pre-aggregation data. The problem specifically occurs when one of the columns contains more than two double quotes ("). This seems to cause an error during the CSV creation process using the csv-write-stream library, resulting in a malformed file. The generated CSV cannot be read correctly, which breaks queries that depend on the pre-aggregation. Here’s the schema used for testing:
cubes:
- name: fct_test
public: false
sql: |
SELECT
title,
value_sum,
value_count,
date_reference
FROM (
SELECT
'"Title" x "Subtitle"' AS title,
97.8 AS value_sum,
10 AS value_count,
CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
UNION ALL
SELECT
'"Title" \ "Subtitle"' AS title,
27.3 AS value_sum,
20 AS value_count,
CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
UNION ALL
SELECT
'"Title 3"' AS title,
30.9 AS value_sum,
30 AS value_count,
CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
UNION ALL
SELECT
'"Title\\4"' AS title,
40.7 AS value_sum,
40 AS value_count,
CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
UNION ALL
SELECT
'"Title"5"' AS title,
58.1 AS value_sum,
50 AS value_count,
CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
UNION ALL
SELECT
'"Title" 6' AS title,
69.8 AS value_sum,
60 AS value_count,
CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
)
dimensions:
- name: title
sql: title
type: string
- name: dateReference
sql: date_reference
type: time
measures:
- name: valueSum
sql: value_sum
type: sum
- name: valueCount
sql: value_count
type: sum
- name: valueAvg
sql: ({valueSum} / {valueCount})
type: number
preAggregations:
- name: main
type: rollup
dimensions:
- title
measures:
- valueSum
- valueCount
time_dimension: dateReference
granularity: day
partition_granularity: month
In the example above, titles like '"Title" x "Subtitle"' and '"Title" \ "Subtitle"' contain multiple double quotes, which appears to be causing the issue. When titles include only one or two double quotes, the CSV appears to be generated correctly. Could you test this schema and verify if the problem occurs in your environment? It would be great to confirm whether the issue is directly related to csv-write-stream or if there’s something we can adjust in Cube.js to prevent these errors. Thank you in advance for your attention and support!
We also faced the same issue and tried to modify the cubestore, but it didn't work for all cases.
@Larissa-Rocha Found any solutions on this?
@kk-r We found out that simply removing the double quotes from the columns in schema was the way to go:
REGEXP_REPLACE(column_1, '"', '''')
Replaced double quotes with single quotes