Kafka icon indicating copy to clipboard operation
Kafka copied to clipboard

Valid parameters with utf8 flag causing fatal errors

Open The-Builder opened this issue 6 years ago • 6 comments

Several methods (e. g. connection, fetch) return an unspecific error when being fed strings with the utf8 flag set. Configurations are often read from yaml files and these, as any other parameters may be flagged 'utf8' although the string is plain ascii.

Reproduce: use Kafka::Connection; my $h = 'localhost';

setting the utf8 flag generates the error

utf8::upgrade($h); print utf8::is_utf8($h) ? "isUtf8\n" : "isNOTUtf8\n";

Kafka::Connection->new(host => $h); exit;

Output without the flag ~/temp$ ./test.pl isNOTUtf8

Output with the flag being set ~/temp$ ./test.pl isUtf8 Invalid argument: host

Trace begun at /opt/perl-5.12.3/lib/site_perl/5.12.3/Kafka/Connection.pm line 1592 Kafka::Connection::_error('Kafka::Connection=HASH(0x2536cf0)', -1000, 'host') called at /opt/perl-5.12.3/lib/site_perl/5.12.3/Kafka/Connection.pm line 447 Kafka::Connection::new('Kafka::Connection', 'host', 'localhost') called at test.pl line 28

Workaround: Use utf8::downgrade() on the parameters before passing to the Kafka methods.

Ikegami strongly discourages from using is_utf8() anywhere but debugging/developing. https://stackoverflow.com/questions/14579560/am-i-using-utf8is-utf8-correctly

Here Kafka should probably leave the parameter error handling to e.g. the network layer.

Also, error messages should be more explicit, telling the user exactly what went wrong.

Thanks to @eserte who helped me with debugging this case.

The-Builder avatar Jan 11 '19 19:01 The-Builder

Passing Unicode data down to network or protocol internals causes even more obscure errors, and this is why there is an argument validation step to prevent accidental use of utf8 data. Its error message is not adequate though, so I'm fixing this to throw a better error message instead.

As for preventing the issue, the answer is simple: use Encode::encode to convert utf8 data (if you need or expect it) into bytes before calling any methods.

asolovey avatar Jan 19 '19 20:01 asolovey

This is what happens with your example on bf76926:

$perl -E 'use Kafka::Connection;my $h = 'localhost';utf8::upgrade($h);Kafka::Connection->new(host => $h);'
Unicode data is not allowed: host

Trace begun at lib/Kafka/Connection.pm line 1612
Kafka::Connection::_error('Kafka::Connection=HASH(0x7fffcd6caf58)', -1014, 'host') called at lib/Kafka/Connection.pm line 464
Kafka::Connection::new('Kafka::Connection', 'host', 'localhost') called at -e line 1

The error message says that unicode data is not allowed and identifies which argument is invalid: Unicode data is not allowed: host.

asolovey avatar Jan 19 '19 21:01 asolovey

I'm understanding your point to early catch any possible causes for otherwise hard to detect errors on protocol level. On the other hand, getting an Unicode error when passing a seemingly pure ascii string, might be puzzling as well to the unsuspecting user who could have little or no knowledge of encodings at all.

As a convenience for the user I would suggest to better check if high bytes are present at all and otherwise just downgrade the string.

The-Builder avatar Jan 20 '19 16:01 The-Builder

host has now the changed error message, but broker_list has not.

Anyway, I am still not convinced that the utf8 flag is problematic here. We're talking about hostnames, which are usually ASCII anyway. If it's now ASCII with or without utf8 flag shouldn't matter. In fact, I tried to remove some utf8::is_utf8 checks from the code and still it seems to work.

eserte avatar Jan 31 '19 10:01 eserte

While it my seem inconvenient sometimes, the current behavior was (AFAIK) designed to be consistent and predictable. If there may be Unicode data in input, it should be taken care of before calling this API. Otherwise API may cause subtle failures where the same data flow works and then, suddenly, does not, depending on input data. But the input data may be perfectly valid and, if encoded properly by the code providing this data and thus having more knowledge about it, would be processed just fine every time.

Note that some languages are able to make this requirement by using a different input data type, e.g. byte[] instead of string so user is forced to encode data before calling such API. We do not have this option in Perl though so validation is obviously runtime-only.

I also understand that server host and broker list may seem to be an exception, but it is hard to draw a line - some of these are coming from the protocol metadata and never ever could be Unicode so allowing Unicode in user-supplied values feels like an inconsistency.

asolovey avatar Feb 04 '19 05:02 asolovey

Just two notes on this topic:

  • At work we use the same approach for storing database or middleware configuration data: configuration for hostnames/users/tables/queues/topics/... is in YAML files. This worked without problems for many systems (MySQL, MongoDB, ActiveMQ, RabbitMQ) — Kafka.pm was the first module which gave us the surprise with the problematic utf8 flag.

  • Perl itself has also checks in some functions against problematic Unicode data — but the approach is different here: it's not problematic to have the utf8 flag set per se. It's only problematic if a function which only can deal with bytes (e.g. syswrite, sysread, gethostbyname) was called with codepoints ≥ \x{0100}. Only in this case perl would throw an exception. Maybe this could be approch which Kafka.pm could also adapt?

eserte avatar Feb 04 '19 14:02 eserte