cube Pre-aggregations containing nullable decimal measures fail in Athena Driver

Describe the bug I'm trying to pre-aggregate a measure that contains a decimal value, while other measures can be NULL, and I'm having the following error in pre-aggregation:

Error: Error during create table: CREATE TABLE prod_pre_aggregations.fct_my_table_name_main20241001_asdfghjkl ('list of fields and data types') WITH (input_format = 'csv_no_header', delimiter = '^A', build_range_end = '2024-10-31T23:59:59.999') LOCATION ?, ?: Create table failed: Can't parse '\N
4010' column value for 'fct_user_progress_contents__value_workload_total_completed' column: Internal: invalid float literal
    at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
    at WebSocket.emit (node:events:519:28)
    at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
    at Receiver.emit (node:events:519:28)
    at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
    at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
    at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
    at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:570:12)
    at _write (node:internal/streams/writable:499:10)
    at Receiver.Writable.write (node:internal/streams/writable:508:10)
    at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
    at TLSSocket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)

To Reproduce Steps to reproduce the behavior:

Create a cube yml with the following metrics:

- name: value_workload_total
        sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
        type: sum

      - name: value_workload_total_completed
        sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
        type: sum
        filters:
          - sql: flag_enrollment_completed_pathway = TRUE

Put both in pre aggregation
Run pre aggregation

Expected behavior I expect the metric to pre aggregate

Version: [e.g. 1.0.5]

Additional context I found a similar issue here: https://github.com/cube-js/cube/issues/6693 for snowflake driver, I wonder how to solve it for athena

Dec 02 '24 18:12 Larissa-Rocha

Hi @Larissa-Rocha 👋

I've tried to reproduce this and while I wasn't able to get exactly the same error, I acknowledge that this case could have been handled better.

The data model that I used:

cubes:
  - name: test
    sql: >
      SELECT
        NULL AS value, 
        NULL AS value_workload_minutes,
        NULL AS value_workload_minutes_count,
        TRUE AS flag_enrollment_completed_pathway

    dimensions:
      - name: value
        sql: value
        type: string

    measures:
      - name: value_workload_total
        sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
        type: sum

      - name: value_workload_total_completed
        sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
        type: sum
        filters:
          - sql: flag_enrollment_completed_pathway = TRUE

    pre_aggregations:
      - name: main
        dimensions:
          - value
        measures:
          - value_workload_total
          - value_workload_total_completed

Pre-aggregation does not build—here's the error:

Error: Error during create table: CREATE TABLE prod_pre_aggregations.test_main_srp5fcpr_isgxjz4g_1jl16fr (`test__value` unknown) LOCATION ?: Custom type 'unknown' is not supported
    at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
    at WebSocket.emit (node:events:519:28)
    at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
    at Receiver.emit (node:events:519:28)
    at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
    at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
    at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
    at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:570:12)
    at _write (node:internal/streams/writable:499:10)
    at Receiver.Writable.write (node:internal/streams/writable:508:10)
    at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
    at TLSSocket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)

My Cube version was v1.1.7.

That being said, I guess that a possible workaround would be to safeguard against NULL values as much as possible. For example, all potentially NULL values can be surrounded with COALESCE(..., 0) or something similar so that they evaluate to 0 instead of NULL. Obviously, what to substitute NULLs with depends on your domain.

Dec 04 '24 18:12 igorlukanin

Hi @igorlukanin, after further investigation, I found that pre aggregation in columns containing special characters like ", á, â also fail with Can't parse column value error. Do you have any tips on how to escape those values or if this issue would be addressed any time soon? I'm using Cube Cloud 1.0.5

Dec 10 '24 16:12 Larissa-Rocha

Hi @Larissa-Rocha 👋

I've quickly checked it with the following data model:

cubes:
  - name: test_2
    sql: >
      SELECT 'á' AS value UNION ALL
      SELECT 'â' AS value UNION ALL
      SELECT '"' AS value 

    dimensions:
      - name: value
        sql: value
        type: string

    pre_aggregations:
      - name: main
        dimensions:
          - value

And it worked as expected: Screenshot 2024-12-16 at 13 28 37

So I guess there's something else that should be causing that error. Could you please help narrow this down?

Dec 16 '24 12:12 igorlukanin

Hi @igorlukanin! We’ve identified an issue when using the schema below in Cube.js. The pre-aggregation configured to be partitioned by month encounters errors when Cube attempts to create a CSV file to store the pre-aggregation data. The problem specifically occurs when one of the columns contains more than two double quotes ("). This seems to cause an error during the CSV creation process using the csv-write-stream library, resulting in a malformed file. The generated CSV cannot be read correctly, which breaks queries that depend on the pre-aggregation. Here’s the schema used for testing:

cubes:
  - name: fct_test
    public: false

    sql: |
      SELECT 
        title,
        value_sum,
        value_count,
        date_reference
      FROM (
        SELECT 
          '"Title" x "Subtitle"' AS title,
          97.8 AS value_sum,
          10 AS value_count,
          CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title" \ "Subtitle"' AS title,
          27.3 AS value_sum,
          20 AS value_count,
          CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title 3"' AS title,
          30.9 AS value_sum,
          30 AS value_count,
          CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title\\4"' AS title,
          40.7 AS value_sum,
          40 AS value_count,
          CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title"5"' AS title,
          58.1 AS value_sum,
          50 AS value_count,
          CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title" 6' AS title,
          69.8 AS value_sum,
          60 AS value_count,
          CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
      )
    
    dimensions:
      - name: title
        sql: title
        type: string
        
      - name: dateReference
        sql: date_reference
        type: time
        
    measures:
      - name: valueSum
        sql: value_sum
        type: sum
        
      - name: valueCount
        sql: value_count
        type: sum
        
      - name: valueAvg
        sql: ({valueSum} / {valueCount})
        type: number
        
    preAggregations:
      - name: main
        type: rollup
        
        dimensions:
          - title
        
        measures:
          - valueSum
          - valueCount          
          
        time_dimension: dateReference
        granularity: day
        partition_granularity: month

In the example above, titles like '"Title" x "Subtitle"' and '"Title" \ "Subtitle"' contain multiple double quotes, which appears to be causing the issue. When titles include only one or two double quotes, the CSV appears to be generated correctly. Could you test this schema and verify if the problem occurs in your environment? It would be great to confirm whether the issue is directly related to csv-write-stream or if there’s something we can adjust in Cube.js to prevent these errors. Thank you in advance for your attention and support!

Dec 16 '24 13:12 Larissa-Rocha

We also faced the same issue and tried to modify the cubestore, but it didn't work for all cases.

@Larissa-Rocha Found any solutions on this?

May 15 '25 14:05 kk-r

@kk-r We found out that simply removing the double quotes from the columns in schema was the way to go:

REGEXP_REPLACE(column_1, '"', '''')

Replaced double quotes with single quotes

May 15 '25 19:05 Larissa-Rocha