Starting fdbserver coordinator with mismatching local and public ports leads to internal errors
If one starts an fdbserver process with different local and public ports, then the server begins to complain about internal errors, but it doesn't crash (which is fun).
For example, I can create a cluster file:
alec$ cat fdb.testports.cluster
asdfasdf:[email protected]:4556
Then if I start a single instance fdbserver that claims to be the coordinator but listens on a different port:
alec$ /usr/local/libexec/fdbserver -L logs/ -d data/ -p 127.0.0.1:4557 -l 127.0.0.1:4556 -C fdb.testports.cluster
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbserver.debug -arch x86_64 -l 0x1035e5000 0x104890f60 0x104804ff9 0x104803cad 0x10377b048 0x1048fb626 0x1048f699e 0x103cde08a 0x7fff61e223d5
FDBD joined cluster.
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbserver.debug -arch x86_64 -l 0x1035e5000 0x104890f60 0x104804ff9 0x104803cad 0x10377b048 0x1048fb626 0x1048f699e 0x103cde08a 0x7fff61e223d5
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbserver.debug -arch x86_64 -l 0x1035e5000 0x104890f60 0x104804ff9 0x104803cad 0x10377b048 0x1048fb626 0x1048f699e 0x103cde08a 0x7fff61e223d5
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbserver.debug -arch x86_64 -l 0x1035e5000 0x104890f60 0x104804ff9 0x104803cad 0x10377b048 0x1048fb626 0x1048f699e 0x103cde08a 0x7fff61e223d5
Then, if I try and connect to that cluster:
alec$ fdbcli -C fdb.testports.cluster
Using cluster file `fdb.testports.cluster'.
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbcli.debug -arch x86_64 -l 0x10ccd5000 0x10d060ea0 0x10cfe0ca9 0x10cfdf95d 0x10cce2908 0x10d0ccb16 0x10d0c793e 0x10ce3cb87 0x10ccff8a9 0x7fff61e223d5
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbcli.debug -arch x86_64 -l 0x10ccd5000 0x10d060ea0 0x10cfe0ca9 0x10cfdf95d 0x10cce2908 0x10d0ccb16 0x10d0c793e 0x10ce3cb87 0x10ccff8a9 0x7fff61e223d5
Internal Error @ fdbrpc/FlowTransport.actor.cpp 789:
atos -o fdbcli.debug -arch x86_64 -l 0x10ccd5000 0x10d060ea0 0x10cfe0ca9 0x10cfdf95d 0x10cce2908 0x10d0ccb16 0x10d0c793e 0x10ce3cb87 0x10ccff8a9 0x7fff61e223d5
Same thing.
I am running a 6.1.8 version of fdbserver and fdbcli. The line that it's vomiting on in 6.1.8 is here:
https://github.com/apple/foundationdb/blob/bd6b10cbcee08910667194e6388733acd3b80549/fdbrpc/FlowTransport.actor.cpp#L789
So, it's an assert statement about ports matching, so I guess it's not that surprising that we hit it on certain port mismatch problems.
I also tried to see what would happen if I chose a non-coordinator process. I started two servers, one which was not a coordinator and had mismatched ports:
alec$ /usr/local/libexec/fdbserver -L logs/ -d data/ -p 127.0.0.1:4557 -l 127.0.0.1:4554 -C ~/foundation/fdb.testports.cluster
Then I started one that had the right ports:
alec$ /usr/local/libexec/fdbserver -d data2/ -L logs2/ -p 127.0.0.1:4556 -C ~/foundation/fdb.testports.cluster
FDBD joined cluster.
Then I tried to connect to it via the cli:
alec$ fdbcli -C fdb.testports.cluster
Using cluster file `fdb.testports.cluster'.
The database is unavailable; type `status' for more information.
Welcome to the fdbcli. For help, type `help'.
fdb> status
Using cluster file `fdb.testports.cluster'.
The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.
127.0.0.1:4556 (reachable)
Unable to locate the data distributor worker.
Unable to locate the ratekeeper worker.
fdb> configure new single memory
WARNING: Long delay (Ctrl-C to interrupt)
As might be expected, I was unable to connect to it. However, somewhat more surprisingly, perhaps, is that when I tried to configure it as a new database, that process hung. I think it was trying to communicate with the process using it's listen port, but as it wasn't actually listening, it didn't hear back (but it was still pinging, so maybe it thought it was still alive?). Something weird, for sure.
These tests were done on macOS, but I've seen similar results in certain flavors of Linux, so I don't think it's a platform specific thing. I didn't try and see what happens if one uses this feature "correctly", i.e., start up some container that maps ports weirdly so that the correct configuration is to have different listen and public ports.
I'm kind of not sure what the "right" behavior should be. If it really is the case that we don't support setting a different listen and public port (which you could conceivably want in some case where, say, your ports are routed to different ports by, e.g., some kind of network container type thing), then I suppose we should throw an error on start up if they mismatch. Even just better warnings or error messages might be helpful so that clients can better diagnose what's going on.
See, e.g., #1690, where I believe this behavior causes issues if one tries to set the FDB_PORT environment variable.
#222 is also about this sort of problem, and that an ASSERT isn't a great way to communicate what's wrong.
Any hope this will get fixed, or is it fundamentally Hard somehow?
We're running FDB in a container as part of a Testcontainers setup: (1) start an FDB container, (2) run a client-side test suite against the FDB container, (3) stop the FDB container. We want to be able to run multiple test suites at the same time, without all the containers trying to listen on the same host-port.
The container machinery makes it easy to map container-port 4500 to a random available host-port -- so each FDB container can use port 4500 internally, but get its own separate port number in host-port space. But the issue described above keeps this from working, because FDB's public port differs from its listen port.