rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Ensure that strings from external sources eventually end up as utf-8 encoded binaries

Open lukebakken opened this issue 3 years ago • 8 comments

This is necessary since cmd.exe uses ISO-8859-1 encoding and directories can have latin1 characters, like RabbitMQ Sérvér

Follow-up to #5486

Discovered by @dumbbell

lukebakken avatar Aug 18 '22 18:08 lukebakken

AFAIK, Windows enforces filename encoding to UTF-16 (LE?) on disk. Our use of cmd.exe is the source of the problem because I think it uses legacy encodings like ISO-8859-1(5) on my french Windows to maintain backward compatibility. Therefore I suspect it performs string conversions.

Linux and other similar Unix don't enforce anything. Distributions all default to a *.UTF-8 locale and most applications expect to deal with UTF-8-encoded filenames. But non-UTF-8-encoded filenames are still possible (while still using the default locale) and applications will display a "square character" or question mark when they fail to decode that non-UTF-8 string at best, or stop working.

Erlang strings are supposed to be lists of Unicode codepoints. I think RabbitMQ should convert input strings to Unicode the earliest possible in key places, like when it reads the output of rabbitmq-env-conf.bat and expect to work with Unicode strings and UTF-8-encoded binaries everywhere internally.

Therefore to solve the problem with the JSON encoding crash, I would prefer to convert whatever we try to JSON-encode at the time we get/read that string instead. In the end, there should be nothing to convert on Linux and a few strings we get because of our use of cmd.exe on Windows.

dumbbell avatar Aug 19 '22 07:08 dumbbell

@michaelklishin I'll check that out, thanks.

@dumbbell I still got the encoding error with the fix for the output of rabbitmq-env-conf.bat. I'll see if I can figure out how that latin1 string made it to the JSON encoder.

lukebakken avatar Aug 19 '22 12:08 lukebakken

@michaelklishin that bug is fixed. [] was incorrectly being passed as options to thoas.

lukebakken avatar Aug 19 '22 13:08 lukebakken

Thanks @dumbbell. I will continue reviewing our use of list_to_binary/1 to see if there are other places that should be updated. I will also test using the data given in https://github.com/rabbitmq/rabbitmq-server/discussions/5805

cc @michaelklishin

lukebakken avatar Sep 20 '22 15:09 lukebakken

Currently testing with RABBITMQ_BASE set to C:\Users\bakkenl\Евгений

lukebakken avatar Sep 21 '22 16:09 lukebakken

Aha!

2022-09-21T09:58:43.174000-07:00 info: FORMATTER CRASH: {"Stopping message store for directory '~s'",[[99,58,47,85,115,101,114,115,47,98,97,107,107,101,110,108,47,1045,1074,1075,1077,1085,1080,1081,47,100,98,47,114,97,98,98,105,116,64,98,97,107,107,101,110,108,45,122,48,49,45,109,110,101,115,105,97,47,109,115,103,95,115,116,111,114,101,115,47,118,104,111,115,116,115,47,57,69,82,72,66,81,72,57,52,72,52,83,54,89,55,68,90,66,49,77,88,73,78,67,48,47,109,115,103,95,115,116,111,114,101,95,112,101,114,115,105,115,116,101,110,116]]}

lukebakken avatar Sep 21 '22 17:09 lukebakken

Running into cuttlefish errors as well when setting configuration paths using unicode:

rabbitmq.conf.txt

lukebakken avatar Sep 21 '22 17:09 lukebakken

OK to keep the scope reasonable for this PR, I think I will deal with the format crash above, and leave the other unicode issues to the future, since I think fixes in cuttlefish will be required.

lukebakken avatar Sep 21 '22 17:09 lukebakken

@dumbbell @michaelklishin - I fixed all the format strings dealing with vhost names, since I tested with a vhost named RabbitMQ Sérvér/Евгений (with tags Евгений, foo bar, Sérvér).

My guess is that all ~s and ~p format strings in the codebase can be changed to ~ts and ~tp.

In order to see the unicode correctly in the werl.exe window, I had to set the following:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=+pc unicode

I of course set the following as a system env var to test everything out:

RABBITMQ_BASE=C:\ProgramData\RabbitMQ Sérvér\Евгений

Config files from that location: Евгений.zip

PS I used this branch to fix some of ra's output - https://github.com/rabbitmq/ra/pull/305

I'm going to do a similar set of tests on Linux

lukebakken avatar Sep 23 '22 00:09 lukebakken

Here is how I tested on Linux:

  • Create a generic unix package from this branch
  • Create a directory like /tmp/RabbitMQ Sérvér - Евгений
  • Copy the configuration files from here into that directory. You'll have to edit rabbitmq-env.conf to point the base dir to whatever you created above.
  • Note that unicode is still not allowed in rabbitmq.conf, due to a bug in cuttlefish. I'm going to work on that next.
  • Start the generic unix package this way:
    RABBITMQ_CONF_ENV_FILE="/path/to/rabbitmq-env.conf" ./sbin/rabbitmq-plugins enable rabbitmq_management rabbitmq_top
    LOG=debug RABBITMQ_ALLOW_INPUT=true RABBITMQ_CONF_ENV_FILE="/path/to/rabbitmq-env.conf" ./sbin/rabbitmq-server
    
  • Investigate the output and log files to ensure there aren't any lists of integers or mis-encoded unicode strings
  • Click around the management UI to ensure no 500s are returned, and that strings look correct
  • Create a vhost named using strings with non-ASCII, non latin characters, like RabbitMQ Sérvér/Евгений. I also used Евгений for the description and a tag value like Евгений, foo bar baz , Sérvér, which correctly ends up as three tags:
    $ ./sbin/rabbitmqctl list_vhosts name tags
    Listing vhosts ...
    name    tags
    /       []
    RabbitMQ Sérvér/Евгений [Евгений, foo bar baz, Sérvér]
    

lukebakken avatar Sep 23 '22 20:09 lukebakken

I run into the following exception when following the QA instructions. First, here is my /tmp/Евгений/rabbitmq-env.conf which I simplified since the exception seems to happen when the node attempts to load and parse it:

ADVANCED_CONFIG_FILE="/tmp/Евгений/advanced.config"
CONFIG_FILE="/tmp/Евгений/rabbitmq.conf"
ENABLED_PLUGINS_FILE="/tmp/Евгений/enabled_plugins"
LOG_BASE="/tmp/Евгений/log"
MNESIA_BASE="/tmp/Евгений/db"
SERVER_ADDITIONAL_ERL_ARGS='+pc unicode'

klishin-error-5551.txt

michaelklishin avatar Sep 24 '22 17:09 michaelklishin

Thanks @michaelklishin I'll check into that!

lukebakken avatar Sep 24 '22 18:09 lukebakken

@michaelklishin I'm not exactly sure how, but commit https://github.com/rabbitmq/rabbitmq-server/pull/5551/commits/3c1c7ada5898b8512642af2faf203a0da6e6b976 appears to have fixed the issue. I made those changes to get some additional output as to what the problem was in unicode:characters_to_list/1 but the issue seems to be resolved 🤔

Let me know if it works in your env, thanks!

2022-09-24 12:25:36.792985-07:00 [debug] <0.140.0> Sourcing $RABBITMQ_CONF_ENV_FILE: /tmp/Евгений/rabbitmq-env.conf
2022-09-24 12:25:36.796221-07:00 [debug] <0.140.0> $RABBITMQ_CONF_ENV_FILE exit status: 0
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0> $RABBITMQ_CONF_ENV_FILE output:
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + . /tmp/Евгений/rabbitmq-env.conf
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + ADVANCED_CONFIG_FILE=/tmp/Евгений/advanced.config
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + CONFIG_FILE=/tmp/Евгений/rabbitmq.conf
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + ENABLED_PLUGINS_FILE=/tmp/Евгений/enabled_plugins
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + LOG_BASE=/tmp/Евгений/log
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + MNESIA_BASE=/tmp/Евгений/db
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + SERVER_ADDITIONAL_ERL_ARGS=+pc unicode
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + echo -----VARS-PID-2296543-----
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   -----VARS-PID-2296543-----
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   + set
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   ADVANCED_CONFIG_FILE='/tmp/Евгений/advanced.config'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   ASDF_DIR='/home/lbakken/.asdf'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   BINDIR='/home/lbakken/.asdf/installs/erlang/25.1/erts-13.1/bin'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   CLOUDSDK_PYTHON='/usr/bin/python'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   CLOUDSDK_PYTHON_ARGS='-S'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   CLOUDSDK_ROOT_DIR='/opt/google-cloud-sdk'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   CONFIG_FILE='/tmp/Евгений/rabbitmq.conf'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   CONF_ENV_FILE_PHASE='rabbtimq-prelaunch'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   DBUS_SESSION_BUS_ADDRESS='unix:path=/run/user/1000/bus'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   DOCKER_BUILDKIT='1'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   DOTNET_BUNDLE_EXTRACT_BASE_DIR='/home/lbakken/.cache/dotnet_bundle_extract'
2022-09-24 12:25:36.796635-07:00 [debug] <0.140.0>   DOTNET_ROOT='/usr/share/dotnet'

lukebakken avatar Sep 24 '22 19:09 lukebakken

Yup, it works as expected now 👍

michaelklishin avatar Sep 25 '22 23:09 michaelklishin

Finally backported for 3.11.5.

michaelklishin avatar Dec 02 '22 10:12 michaelklishin