duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

C++ UDF produces fatal crash with carriage return chars

Open prmelehan opened this issue 1 year ago • 3 comments

What happens?

When using a UDF in C++, which manipulates string values, if duckdb::string_t is passed a string which contains \r, \v, etc, the resulting query crashes with error:


{"exception_type":"Invalid Input","exception_message":"Invalid unicode (byte sequence mismatch) detected in value construction"}

This happens when you provide a std::string with \r to the duckdb::string_t constructor, whether or not you received the string from string_t.GetString().

Hint: - I think this may be due to how the string_t(const string &value) is implemented ??

To Reproduce

Create a UDF which returns a string_t type


duckdb::string_t my_udf(duckdb::string_t value) {
    
    // Create a std::string from this value...
    std::string copy = value.GetString();
    
    // Do any manipulation you might want on the std::string...
    
    
    // Create a new string_t value to return back...
    duckdb::string_t outputString = duckdb::string_t(copy);
    
    return outputString;
}

Setup connection and add the function


duckdb::DuckDB db(nullptr);
duckdb::Connection con(db);

// Add our DuckDB UDF...
con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf);

Execute the Query against a value with \r or other special character

Example file: failing-value.parquet.zip


string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'";
    
std::unique_ptr<duckdb::QueryResult> result = con.Query(query);

Fetch a data chunk via GetValue and receive error

while (true) {
        auto chunk = result->Fetch();
        if (!chunk || chunk->size() == 0) {
            break;
        }
        
        auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test...
        
        cout << finalResult.GetValueUnsafe<string>() << endl;
}

Error

libc++abi: terminating due to uncaught exception of type duckdb::InvalidInputException: {"exception_type":"Invalid Input","exception_message":"Invalid unicode (byte sequence mismatch) detected in value construction"}

OS:

macOS

DuckDB Version:

1.0.0

DuckDB Client:

C++

Full Name:

Ryan Melehan

Affiliation:

Coco Alemana, Inc

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • [X] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [X] Yes, I have

prmelehan avatar Aug 21 '24 20:08 prmelehan

I noticed this might be a function of string inlining, where a new pointer is created for non inlined strings?

Also, I noticed that the test suite does not take into account functions that modify string values and return new ones:

https://github.com/duckdb/duckdb/blob/10ea4832d3f1850685a65369e0b19c27ec81e638/test/api/udf_function/udf_functions_to_test.hpp#L117-L125

Only returning the original value...

Am I incorrect here?

This is blocking critical development for us, so any help would be much appreciated !

prmelehan avatar Aug 22 '24 04:08 prmelehan

Any updates on this @szarnyasg ? I can provide any further information that you or your colleagues require.

prmelehan avatar Aug 23 '24 03:08 prmelehan

Hi @prmelehan, the issue was added to our internal tracker but has not yet been assigned to an engineer. Unfortunately, we cannot provide a timeline for the fix. If you are interested in priority support, please contact DuckDB Labs via https://duckdblabs.com/contact/.

szarnyasg avatar Aug 23 '24 04:08 szarnyasg

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Nov 22 '24 00:11 github-actions[bot]

This unfortunately still occurs

prmelehan avatar Nov 22 '24 00:11 prmelehan

as you mentioned, when it's inline the buffer is part of string_t, but when it's not it's just pointing to an address, that may no exists after the UDF is returned.

If you changed that to a const char * it will work for example.

An alternative could be to use the StringHeap and destroy it after the UDF is called?

Please note that unique_ptr is not needed, here, i was just testing...

#include "duckdb.hpp"
#include <memory>
#include <stdio.h>
#include <cstdio>
using namespace duckdb;

static unique_ptr<duckdb::StringHeap> my_heap;

static duckdb::string_t my_udf(duckdb::string_t value) {
//you can copy the string as before in your test.
	std::string ars = "This is not inlined anymore - "; 
	return my_heap->AddString(ars);
}
int main() {
	DuckDB db(nullptr);

	Connection con(db);
	my_heap = make_uniq<duckdb::StringHeap>();
	// Add our DuckDB UDF...
	con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf);
	string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'";

	std::unique_ptr<duckdb::QueryResult> result = con.Query(query);
	while (true) {
		auto chunk = result->Fetch();
		if (!chunk || chunk->size() == 0) {
			break;
		}
		auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test...
		printf("Result -> %s\n", finalResult.GetValueUnsafe<string>().c_str());
	}

	my_heap->Destroy();
}

lcostantino avatar Jan 07 '25 21:01 lcostantino

as you mentioned, when it's inline the buffer is part of string_t, but when it's not it's just pointing to an address, that may no exists after the UDF is returned.

If you changed that to a const char * it will work for example.

An alternative could be to use the StringHeap and destroy it after the UDF is called?

Please note that unique_ptr is not needed, here, i was just testing...

#include "duckdb.hpp"
#include <memory>
#include <stdio.h>
#include <cstdio>
using namespace duckdb;

static unique_ptr<duckdb::StringHeap> my_heap;

static duckdb::string_t my_udf(duckdb::string_t value) {
//you can copy the string as before in your test.
	std::string ars = "This is not inlined anymore - "; 
	return my_heap->AddString(ars);
}
int main() {
	DuckDB db(nullptr);

	Connection con(db);
	my_heap = make_uniq<duckdb::StringHeap>();
	// Add our DuckDB UDF...
	con.CreateScalarFunction<duckdb::string_t, duckdb::string_t>("my_udf", &my_udf);
	string query = "SELECT my_udf(\"Class\") AS x FROM 'failing-value.parquet'";

	std::unique_ptr<duckdb::QueryResult> result = con.Query(query);
	while (true) {
		auto chunk = result->Fetch();
		if (!chunk || chunk->size() == 0) {
			break;
		}
		auto finalResult = chunk->GetValue(0, 0); // Only 0,0 because we have 1 value, and 1 chunk in this test...
		printf("Result -> %s\n", finalResult.GetValueUnsafe<string>().c_str());
	}

	my_heap->Destroy();
}

Hello Icostantino and all,

Actually while i building this segment codes under visual studio 2022 with DuckDB v1.1.3, a error throwing at link stage:

main.obj : error LNK2019: 无法解析的外部符号 "public: static void __cdecl duckdb::UDFWrapper::RegisterFunction(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >,class duckdb::vector<struct duckdb::LogicalType,1>,struct duckdb::LogicalType,class std::function<void __cdecl(class duckdb::DataChunk &,struct duckdb::ExpressionState &,class duckdb::Vector &)>,class duckdb::ClientContext &,struct duckdb::LogicalType)" (?RegisterFunction@UDFWrapper@duckdb@@SAXV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@V?$vector@ULogicalType@duckdb@@$00@2@ULogicalType@2@V?$function@$$A6AXAEAVDataChunk@duckdb@@AEAUExpressionState@2@AEAVVector@2@@Z@4@AEAVClientContext@2@2@Z),函数 "public: static void __cdecl duckdb::UDFWrapper::RegisterFunction<struct duckdb::string_t,struct duckdb::string_t>(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &,class std::function<void __cdecl(class duckdb::DataChunk &,struct duckdb::ExpressionState &,class duckdb::Vector &)>,class duckdb::ClientContext &,struct duckdb::LogicalType)" (??$RegisterFunction@Ustring_t@duckdb@@U12@@UDFWrapper@duckdb@@SAXAEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@V?$function@$$A6AXAEAVDataChunk@duckdb@@AEAUExpressionState@2@AEAVVector@2@@Z@3@AEAVClientContext@1@ULogicalType@1@@Z) 中引用了该符号

have you met this error? and kindly tell/help me to resolve this issue? BTW, if i comment this call CreateScalarFunction, this building will alright.

Looking for any reply, thank you so much!

yiivon avatar Jan 08 '25 17:01 yiivon

After a quick look through the udf wrapper code, this just looks broken. There is no overload for string_t to perform a call to StringVector::AddString(result, returned_string) which makes this inherently unsafe for any string over 12 bytes (maximum inlined size)

Tishj avatar Jan 08 '25 19:01 Tishj

After a quick look through the udf wrapper code, this just looks broken. There is no overload for string_t to perform a call to StringVector::AddString(result, returned_string) which makes this inherently unsafe for any string over 12 bytes (maximum inlined size)

I'm not sure what you mean. Is a StringHeap not an StringVector

Btw, this is just an example, if the string_t can be inlined it should be returned instead.

string_t StringHeap::AddString(const char *data, idx_t len) { D_ASSERT(Utf8Proc::Analyze(data, len) != UnicodeType::INVALID); return AddBlob(data, len); }

string_t StringHeap::AddString(const char *data) { return AddString(data, strlen(data)); }

string_t StringHeap::AddString(const string &data) { return AddString(data.c_str(), data.size()); }

string_t StringHeap::AddString(const string_t &data) { return AddString(data.GetData(), data.GetSize()); }

string_t StringHeap::AddString(const char *data, idx_t len) { D_ASSERT(Utf8Proc::Analyze(data, len) != UnicodeType::INVALID); return AddBlob(data, len); }

lcostantino avatar Jan 08 '25 19:01 lcostantino

I'm not talking about your comments, I'm talking about the issue at hand, the system should be making sure the strings returned by the UDF are owned by the Vector. The C++ UDF code could use some love, but I don't know when we (labs) will get around to it as it's not currently a high priority

Using a static StringHeap in your udf code is not a solution

Tishj avatar Jan 09 '25 20:01 Tishj