C++ UDF produces fatal crash with carriage return chars
What happens?
When using a UDF in C++, which manipulates string values, if duckdb::string_t is passed a string which contains \r, \v, etc, the resulting query crashes with error:
{"exception_type":"Invalid Input","exception_message":"Invalid unicode (byte sequence mismatch) detected in value construction"}
This happens when you provide a std::string with \r to the duckdb::string_t constructor, whether or not you received the string from string_t.GetString().
Hint: - I think this may be due to how the string_t(const string &value) is implemented ??
To Reproduce
Create a UDF which returns a string_t type
duckdb::string_t my_udf(duckdb::string_t value) {
// Create a std::string from this value...
std::string copy = value.GetString();
// Do any manipulation you might want on the std::string...
// Create a new string_t value to return back...
duckdb::string_t outputString = duckdb::string_t(copy);
return outputString;
}
Setup connection and add the function
duckdb::DuckDB db(nullptr);
duckdb::Connection con(db);
// Add our DuckDB UDF...
con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf);
Execute the Query against a value with \r or other special character
Example file: failing-value.parquet.zip
string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'";
std::unique_ptr<duckdb::QueryResult> result = con.Query(query);
Fetch a data chunk via GetValue and receive error
while (true) {
auto chunk = result->Fetch();
if (!chunk || chunk->size() == 0) {
break;
}
auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test...
cout << finalResult.GetValueUnsafe<string>() << endl;
}
Error
libc++abi: terminating due to uncaught exception of type duckdb::InvalidInputException: {"exception_type":"Invalid Input","exception_message":"Invalid unicode (byte sequence mismatch) detected in value construction"}
OS:
macOS
DuckDB Version:
1.0.0
DuckDB Client:
C++
Full Name:
Ryan Melehan
Affiliation:
Coco Alemana, Inc
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- [X] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [X] Yes, I have
I noticed this might be a function of string inlining, where a new pointer is created for non inlined strings?
Also, I noticed that the test suite does not take into account functions that modify string values and return new ones:
https://github.com/duckdb/duckdb/blob/10ea4832d3f1850685a65369e0b19c27ec81e638/test/api/udf_function/udf_functions_to_test.hpp#L117-L125
Only returning the original value...
Am I incorrect here?
This is blocking critical development for us, so any help would be much appreciated !
Any updates on this @szarnyasg ? I can provide any further information that you or your colleagues require.
Hi @prmelehan, the issue was added to our internal tracker but has not yet been assigned to an engineer. Unfortunately, we cannot provide a timeline for the fix. If you are interested in priority support, please contact DuckDB Labs via https://duckdblabs.com/contact/.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This unfortunately still occurs
as you mentioned, when it's inline the buffer is part of string_t, but when it's not it's just pointing to an address, that may no exists after the UDF is returned.
If you changed that to a const char * it will work for example.
An alternative could be to use the StringHeap and destroy it after the UDF is called?
Please note that unique_ptr is not needed, here, i was just testing...
#include "duckdb.hpp"
#include <memory>
#include <stdio.h>
#include <cstdio>
using namespace duckdb;
static unique_ptr<duckdb::StringHeap> my_heap;
static duckdb::string_t my_udf(duckdb::string_t value) {
//you can copy the string as before in your test.
std::string ars = "This is not inlined anymore - ";
return my_heap->AddString(ars);
}
int main() {
DuckDB db(nullptr);
Connection con(db);
my_heap = make_uniq<duckdb::StringHeap>();
// Add our DuckDB UDF...
con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf);
string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'";
std::unique_ptr<duckdb::QueryResult> result = con.Query(query);
while (true) {
auto chunk = result->Fetch();
if (!chunk || chunk->size() == 0) {
break;
}
auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test...
printf("Result -> %s\n", finalResult.GetValueUnsafe<string>().c_str());
}
my_heap->Destroy();
}
as you mentioned, when it's inline the buffer is part of string_t, but when it's not it's just pointing to an address, that may no exists after the UDF is returned.
If you changed that to a const char * it will work for example.
An alternative could be to use the StringHeap and destroy it after the UDF is called?
Please note that unique_ptr is not needed, here, i was just testing...
#include "duckdb.hpp" #include <memory> #include <stdio.h> #include <cstdio> using namespace duckdb; static unique_ptr<duckdb::StringHeap> my_heap; static duckdb::string_t my_udf(duckdb::string_t value) { //you can copy the string as before in your test. std::string ars = "This is not inlined anymore - "; return my_heap->AddString(ars); } int main() { DuckDB db(nullptr); Connection con(db); my_heap = make_uniq<duckdb::StringHeap>(); // Add our DuckDB UDF... con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf); string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'"; std::unique_ptr<duckdb::QueryResult> result = con.Query(query); while (true) { auto chunk = result->Fetch(); if (!chunk || chunk->size() == 0) { break; } auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test... printf("Result -> %s\n", finalResult.GetValueUnsafe<string>().c_str()); } my_heap->Destroy(); }
Hello Icostantino and all,
Actually while i building this segment codes under visual studio 2022 with DuckDB v1.1.3, a error throwing at link stage:
main.obj : error LNK2019: 无法解析的外部符号 "public: static void __cdecl duckdb::UDFWrapper::RegisterFunction(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,class duckdb::vector<struct duckdb::LogicalType,1>,struct duckdb::LogicalType,class std::function<void __cdecl(class duckdb::DataChunk &,struct duckdb::ExpressionState &,class duckdb::Vector &)>,class duckdb::ClientContext &,struct duckdb::LogicalType)" (?RegisterFunction@UDFWrapper@duckdb@@SAXV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@V?$vector@ULogicalType@duckdb@@$00@2@ULogicalType@2@V?$function@$$A6AXAEAVDataChunk@duckdb@@AEAUExpressionState@2@AEAVVector@2@@Z@4@AEAVClientContext@2@2@Z),函数 "public: static void __cdecl duckdb::UDFWrapper::RegisterFunction<struct duckdb::string_t,struct duckdb::string_t>(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &,class std::function<void __cdecl(class duckdb::DataChunk &,struct duckdb::ExpressionState &,class duckdb::Vector &)>,class duckdb::ClientContext &,struct duckdb::LogicalType)" (??$RegisterFunction@Ustring_t@duckdb@@U12@@UDFWrapper@duckdb@@SAXAEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@V?$function@$$A6AXAEAVDataChunk@duckdb@@AEAUExpressionState@2@AEAVVector@2@@Z@3@AEAVClientContext@1@ULogicalType@1@@Z) 中引用了该符号
have you met this error? and kindly tell/help me to resolve this issue? BTW, if i comment this call CreateScalarFunction, this building will alright.
Looking for any reply, thank you so much!
After a quick look through the udf wrapper code, this just looks broken.
There is no overload for string_t to perform a call to StringVector::AddString(result, returned_string) which makes this inherently unsafe for any string over 12 bytes (maximum inlined size)
After a quick look through the udf wrapper code, this just looks broken. There is no overload for
string_tto perform a call toStringVector::AddString(result, returned_string)which makes this inherently unsafe for any string over 12 bytes (maximum inlined size)
I'm not sure what you mean. Is a StringHeap not an StringVector
Btw, this is just an example, if the string_t can be inlined it should be returned instead.
string_t StringHeap::AddString(const char *data, idx_t len) { D_ASSERT(Utf8Proc::Analyze(data, len) != UnicodeType::INVALID); return AddBlob(data, len); }
string_t StringHeap::AddString(const char *data) { return AddString(data, strlen(data)); }
string_t StringHeap::AddString(const string &data) { return AddString(data.c_str(), data.size()); }
string_t StringHeap::AddString(const string_t &data) { return AddString(data.GetData(), data.GetSize()); }
string_t StringHeap::AddString(const char *data, idx_t len) { D_ASSERT(Utf8Proc::Analyze(data, len) != UnicodeType::INVALID); return AddBlob(data, len); }
I'm not talking about your comments, I'm talking about the issue at hand, the system should be making sure the strings returned by the UDF are owned by the Vector. The C++ UDF code could use some love, but I don't know when we (labs) will get around to it as it's not currently a high priority
Using a static StringHeap in your udf code is not a solution