odbc2parquet Memory allocation error due to Longtext column

Hey, not entirely sure this issue will make sense, but I'm an absolute parquet noob and found your tool as a way of dumping stuff from a MariaDB to Parquet to provide this to other folks.

I'm encountering a memory allocation error

# odbc2parquet -vvv query --connection-string "Driver={MySQL ODBC 8.0 Unicode Driver};Server=${MARIADB_HOST};Database=${DATABASE};Uid=${MARIADB_USER};Pwd=${MARIADB_PASS};" --batch-size 100000 --batches-per-file 100 "/tmp/${TABLE}.par" "SELECT * FROM ${TABLE}"
2021-03-19T12:45:21+00:00 - DEBUG - ODBC Environment created.
2021-03-19T12:45:21+00:00 - INFO - Batch size set to 100000
2021-03-19T12:45:21+00:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [114, 101, 115, 101, 108, 108, 101, 114, 95, 105, 100], data_type: Integer, nullability: Nullable }
2021-03-19T12:45:21+00:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: I32 }
[...]
2021-03-19T12:45:21+00:00 - DEBUG - ODBC column description for column 182: ColumnDescription { name: [101, 120, 116, 114, 97, 95, 97, 116, 116, 114, 105, 98, 117, 116, 101, 115], data_type: Other { data_type: SqlDataType(-10), column_size: 65535, decimal_digits: 0 }, nullability: Nullable }
2021-03-19T12:45:21+00:00 - DEBUG - ODBC buffer description for column 182: BufferDescription { nullable: true, kind: Text { max_str_len: 21845 } }
memory allocation of 143165576600000 bytes failed
Aborted (core dumped)

which I'm fairly certain should be connected to the following

2021-03-19T12:45:21+00:00 - DEBUG - ODBC column description for column 166: ColumnDescription { name: [112, 114, 105, 118, 97, 99, 121, 95, 112, 111, 108, 105, 99, 121], data_type: Other { data_type: SqlDataType(-10), column_size: 4294967295, decimal_digits: 0 }, nullability: Nullable }
2021-03-19T12:45:21+00:00 - DEBUG - ODBC buffer description for column 166: BufferDescription { nullable: true, kind: Text { max_str_len: 1431655765 } }

which in MySQL is this

| privacy_policy                        | longtext                                                           | YES  |     | NULL                |                |

The factor between the columns max_str_len and the memory allocation is a bit more than 100000 so this appears too connected to be random to me. I have no influence over the source data, so I will not be able to convince anyone to change the type of this field from LONGTEXT to something more reasonable. The largest entry in this column is 366211 characters, so there's definitely no data in there that would require a memory allocation of 143TB.

I'm not entirely sure why this happens though, hence this issue. The maximum length for an entry in a LONGTEXT column is 4.3GB, which again, none of the entries are even close to having, but no one will be touching this. But how could this lead to a memory allocation of a bit more than 100000 that? I'm guessing the allocation happens somewhere around https://github.com/pacman82/odbc2parquet/blob/master/src/query.rs#L417-L428 given that this is a field of type other? The entire loop runs through though as you can see above. The table in question has 182 columns and we see column/buffer descriptions for every column. The memory allocation error happens after that.

Do you have any ideas of what could be done about this? Would be really nice to dump this data into Parquet but with it randomly crashing right now I'm entirely at a loss :) I'm running odbc2parquet 0.5.9 installed via cargo on debian buster. If I can provide any more data that could help here I'm completely up for that!

Mar 19 '21 16:03 langesven

Hello, I can look at this later, but your report has been very helpful so far. May I draw your attention to the --batch_size parameter. Its default value is 100000 and it describes the number of rows transmitted between data source and buffer in one go. This tool aims to lower IO overhead by fetching many rows at once. To do so it allocates enough memory to hold batch_size entries of the largest possible row.

Guess in your case that is too much. You can lower the batch size of course. Likely to work because modern machines have stupid amounts of memory, but it will be slow, because you need a roundtrip for each row of the database. Currently I have no option for you to provide the domain knowledege of the maximum value size via command line parameter (would this help you?). Sometimes casting to a fixed VARCHAR() in the select statement helps (if you try it, please tell me how it goes).

The very least I can do is to provide a better error message. I am totally open to suggestions and ideas from your side.

Mar 19 '21 17:03 pacman82

Well, do I feel stupid now :D I did actually start specifying the batch size after I ran into this for the first time but I didn't make the connection at all between the 100000 in that argument and the error, wow. Thank you, that completely makes sense. I re-tried now with a batch size of 2 and batches per file of 100000 and that worked completely fine, so that already is a great solution for my problem.

The domain knowledge bit could be interesting, but I'm guessing it's a pain in the ass to implement and also given the very uhm let's call it "questionable" design of my source db here, not sure it'll really work out great. This isn't the only field that's completely off in sizing, just the largest one of the ones being off. And as it could technically change what the largest data is in there (given that the data type obviously allows it) that could be tricky for exports that run more frequently. Unless there's some way of determining something like max(length) of the largest columns and using that instead of the max(data_type_length) to keep it dynamic it could also incur some additional maintenance needs.

I'll have to look into the cast option just out of curiosity, but the smaller batch size is fine for me now, so this would just be to see if I can make it work with that at all. Currently the cast yells at me because of unicode/ascii issues so not entirely sure yet if that's a feasible way or not.

Yeah not sure what options you have regarding error message. Can you like "catch" the allocation error and print something along the lines of what you mentioned here regarding largest column times batch size? It completely makes sense but just didn't occur to me at all.

Mar 19 '21 18:03 langesven

I think you are far from stupid, and I am happy you raised the issue. I do not know how much I can do to make this work out of the box, but at least the tool has a ux problem in case of out of memory errors.

I could calculate the required size beforehand and give a fair warning if it goes over an amount. I also came across this gem in the ODBC4.0 specification:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/ODBC%204.0.md#381-sql_attr_length_exception_behavior

That would allow me to set an upper bound. And fetch truncated values later. Documentation is sparse though. And this probably needs support from the driver, too (don't know, just a guess). Yet it is worth checking out.

Mar 19 '21 18:03 pacman82

How would you feel about specifying the --batch-size in Memory rather than number of rows?

Mar 19 '21 18:03 pacman82

That sounds like a great idea, that would make it really well plannable. E.g. in my use case this is running in a kubernetes pod so I could allocate X GB of RAM to the pod and then use that(-buffer) for the batch sizing without running risk of out of memory errors etc. Worst case the operation just takes longer because of more small queries being made on tables that expose pointlessly large column sizes but it would almost guarantee to work :)

Mar 19 '21 19:03 langesven

So far my strategy for handling these large columns and memory allocations in general is:

Calculate and log the amount of memory required per row
Make batch_size configurable in terms of memory consumption
Set default to 1GB
Warn users if more than that is required for a single row
Try something fancy using ODBC 4.0 and length exception behaviour

Mar 19 '21 20:03 pacman82

Newest version allows for specifying desired memory usage. Defaults to 2 GiB on 64 Bit platforms. There is still more that could be done. Both in terms of either streaming large values or being able to increase the buffer size upon encountering larger values. At least the latter does require driver support, and some research on my side is required wether this would work with MariaDB.

Apr 10 '21 19:04 pacman82

odbc2parquet odbc2parquet copied to clipboard

Memory allocation error due to Longtext column

odbc2parquet
odbc2parquet copied to clipboard