ParquetSharp icon indicating copy to clipboard operation
ParquetSharp copied to clipboard

Support writing arbitrary precision decimal columns

Open dburtsev opened this issue 3 years ago • 7 comments

Hello I am trying to copy data from MS SQL table with NUMERIC(18,9) column "num" to Redshift with NUMERIC(18,9) column. When I try to load data data into Redshift I get an error: ERROR [XX000] [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState XX000] ERROR: Spectrum Scan Error DETAIL:

  error:  Spectrum Scan Error
  code:      15001
  context:   File 'https://s3.us-east-1.amazonaws.com/xyz/new_line.parquet' column 'num' has an invalid type length. Expecting: 8 len in file: 13
  query:     50511242
  location:  dory_util.cpp:1141
  process:   worker_thread [pid=12646]

When I change the destination column precision to 29, I was able to load data. I would like to have destination table precision the same as source table. Thanks, Dmitriy

dburtsev avatar Feb 08 '22 16:02 dburtsev

I'm not familiar with Redshift but this documentation suggests it's possible to load data from a Parquet file that uses a different decimal precision by using an external table definition: https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/

It would be a fair amount of work to implement support for arbitrary precision decimal columns in ParquetSharp and that's not currently a high priority feature.

adamreeve avatar Feb 08 '22 21:02 adamreeve

"To resolve the invalid type length error in Redshift Spectrum, use an external table definition. The table definition must match the "precision" and "scale" values defined in the external file. For example:

create external table ext_schema.tablename (c1 int, c2 decimal (6,2)) stored as PARQUET location 's3://.../.../'; In this example, the updated values (in the c2 decimal column) for "precision" and "scale" values are set to 6 and 2, respectively. Therefore, the CREATE EXTERNAL TABLE definition values listed in the c2 column must match the values defined in the Apache Parquet file."

This is not a critical issue because I am loading data into staging tables, but looks like other libraries can do this without a problem. Can you please put this issue on your ToDo list? Thanks, Dmitriy

dburtsev avatar Feb 08 '22 21:02 dburtsev

Yes I will leave this issue open to track this feature.

adamreeve avatar Feb 08 '22 21:02 adamreeve

Hello Any updates on decimal columns?

dburtsev avatar Apr 25 '23 19:04 dburtsev

Hi @dburtsev, no sorry, I can't give an update on when this will be fixed yet. There's been some small improvements to decimal data reading by allowing to read decimals stored as int32 or int64, but no progress on the write side.

adamreeve avatar Apr 25 '23 22:04 adamreeve

I just ran into this and hand-jammed decimals to be 29 and I'm still scratching my head over they whys and wherefores

new PrimitiveNode("Tot_Drug_Cst", Repetition.Optional, LogicalType.Decimal(29, 2), PhysicalType.ByteArray)

houstonhaynes avatar Aug 02 '24 00:08 houstonhaynes

I made a start on fixing this a while ago but it dropped off my priorities sorry. I think I had it mostly working so will try to pick this up again soon.

adamreeve avatar Aug 02 '24 00:08 adamreeve