Persistence does not seem to handle IDs correctly cross-endian
Hello there. I'm trying to package Mosquitto 2.0.14 for Adélie Linux, which has a focus on building software for many supported CPUs. While running the test suite on 64-bit IBM Power (big endian), we've had some test failures:
Suite: Persist read
Test: Persistence disabled ...passed
Test: Empty file ...passed
Test: Corrupt header ...passed
Test: Unsupported version ...passed
Test: v3 config ok ...FAILED
1. persist_read_test.c:105 - CU_ASSERT_EQUAL(db.last_db_id,0x7856341200000000)
Test: v3 config bad truncated ...passed
Test: v3 config bad dbid ...passed
Test: v3 bad chunk ...FAILED
1. persist_read_test.c:177 - CU_ASSERT_EQUAL(db.last_db_id,0x17)
Test: v3 message store ...FAILED
1. persist_read_test.c:199 - CU_ASSERT_EQUAL(db.msg_store->db_id,1)
Test: v3 client ...passed
Test: v3 client message ...passed
Test: v3 retain ...FAILED
1. persist_read_test.c:313 - CU_ASSERT_EQUAL(db.msg_store->db_id,0x54)
Test: v3 sub ...passed
Test: v4 config ok ...FAILED
1. persist_read_test.c:123 - CU_ASSERT_EQUAL(db.last_db_id,0x7856341200000000)
Test: v4 message store ...FAILED
1. persist_read_test.c:392 - CU_ASSERT_EQUAL(db.msg_store->db_id,0xFEDCBA9876543210)
Test: v5 client ...passed
Test: v5 config bad truncated ...passed
Test: v5 bad chunk ...FAILED
1. persist_read_test.c:459 - CU_ASSERT_EQUAL(db.last_db_id,0x17)
Test: v6 config ok ...FAILED
1. persist_read_test.c:423 - CU_ASSERT_EQUAL(db.last_db_id,0x7856341200000000)
Test: v6 message store ...FAILED
1. persist_read_test.c:481 - CU_ASSERT_EQUAL(db.msg_store->db_id,1)
Test: v6 message store+props ...FAILED
1. persist_read_test.c:521 - CU_ASSERT_EQUAL(db.msg_store->db_id,1)
Test: v6 client ...passed
Test: v6 client message ...passed
Test: v6 client message+props ...passed
Test: v6 retain ...FAILED
1. persist_read_test.c:723 - CU_ASSERT_EQUAL(db.msg_store->db_id,0x54)
Test: v6 sub ...passed
Run Summary: Type Total Ran Passed Failed Inactive
suites 1 1 n/a 0 0
tests 26 26 15 11 0
asserts 222 222 211 11 n/a
Elapsed time = 0.001 seconds
make[1]: *** [Makefile:177: test-broker] Error 11
I added a simple printf to the first test, v3 config ok, to print the actual value:
Test: v3 config ok ...12345678
FAILED
1. persist_read_test.c:105 - CU_ASSERT_EQUAL(db.last_db_id,0x7856341200000000)
Looks like we are not handling endianness correctly somewhere. I've noticed that the majority of the persistence layer seems to use htonX for writing and ntohX for reading, but this is not done for the last_db_id field. I'm not sure the best way to approach this from a compatibility POV; the obvious change would likely break restoring DBs on little endian machines (which is, in all likelihood, a larger audience). Perhaps we could byte-swap the last_db_id field when running on BE machines? I'm thinking of something akin to the Linux kernel's le_to_cpuXX macros.
I've started something at https://github.com/awilfox/mosquitto/commit/165005742c1c6a5b0b6d037c9e83cbe4a18a3092. This is not portable, as it uses the BSD <endian.h> header. This header is also available in Linux and is part of POSIX-next but is not widely available on other platforms (like Windows or Solaris). It does however pass all tests on my Power9 system in big endian mode.
This will likely be relevant for #2536 as well, at least if the users on AIX want to be able to share persisted db files with other systems.
Sorry, I missed the broker tests. It fails two tests:
1653814066: The 'port' option is now deprecated and will be removed in a future version. Please use 'listener' instead.
1653814066: mosquitto version 2.0.14 starting
1653814066: Config loaded from 06-bridge-b2br-late-connection-retain.conf.
1653814066: Opening ipv4 listen socket on port 1898.
1653814066: Opening ipv6 listen socket on port 1898.
1653814066: Bridge local.gwyn.foxkit.us.bridge_sample doing local SUBSCRIBE on topic bridge/#
1653814066: Connecting bridge bridge_sample (127.0.0.1:1894)
1653814066: Bridge gwyn.foxkit.us.bridge_sample sending CONNECT
1653814066: mosquitto version 2.0.14 running
1653814066: New connection from ::1:37400 on port 1898.
1653814066: Client <unknown> closed its connection.
1653814066: Received CONNACK on connection local.gwyn.foxkit.us.bridge_sample.
1653814086: Client local.gwyn.foxkit.us.bridge_sample closed its connection.
1653814086: mosquitto version 2.0.14 terminating
1653814086: Saving in-memory database to 06-bridge-b2br-late-connection-retain.db.
20.467s : ./06-bridge-b2br-late-connection-retain.py
and
1653814045: The 'port' option is now deprecated and will be removed in a future version. Please use 'listener' instead.
1653814045: mosquitto version 2.0.14 starting
1653814045: Config loaded from 11-pub-props.conf.
1653814045: Opening ipv4 listen socket on port 1889.
1653814045: Opening ipv6 listen socket on port 1889.
1653814045: mosquitto version 2.0.14 running
1653814045: New connection from ::1:33018 on port 1889.
1653814045: Client <unknown> closed its connection.
1653814045: New connection from 127.0.0.1:35910 on port 1889.
1653814045: New client connected from 127.0.0.1:35910 as persistent-props-test (p5, c1, k60).
1653814045: No will message specified.
1653814045: Sending CONNACK to persistent-props-test (0, 0)
1653814045: Received SUBSCRIBE from persistent-props-test
1653814045: subpub/qos1 (QoS 0)
1653814045: persistent-props-test 0 subpub/qos1
1653814045: Sending SUBACK to persistent-props-test
1653814065: mosquitto version 2.0.14 terminating
1653814065: Saving in-memory database to mosquitto-1889.db.
Traceback (most recent call last):
File "/home/awilcox/Code/awilfox/user-next/user/mosquitto/src/mosquitto-2.0.14/test/broker/./11-pub-props.py", line 62, in <module>
mosq_test.expect_packet(sock, "publish2", publish2_packet)
File "/home/awilcox/Code/awilfox/user-next/user/mosquitto/src/mosquitto-2.0.14/test/mosq_test.py", line 91, in expect_packet
packet_recvd = sock.recv(rlen)
TimeoutError: timed out
20.628s : ./11-pub-props.py
I've had a look and it's not explicitly documented, but there has never been a guarantee that the persistence files are portable across architectures. In particular moving between 32-bit and 64-bit architectures will fail. htons and htonl are standard, but as you've implied there isn't a standard for 64-bit ints. I seem to recall looking at the mess around this on just unix platforms and deciding not to worry about it too much assuming that the number of people swapping persistence files between machines with different endian was going to be absolutely miniscule.
For you the problem is that the test itself is at fault, not that the persistence read/write functions are at fault. If the hard coded values the tests are testing against were converted to big endian then I imagine the first set of tests would pass just fine.