mongodb icon indicating copy to clipboard operation
mongodb copied to clipboard

Issues faced while getting mongodb test suite running locally

Open rhishikeshj opened this issue 4 years ago • 23 comments

Here are some issues I faced while getting this MongoDB jepsen suite running locally with docker. Information about the code that I am using

Jepsen : commit a2bcad59f0df5bd39cea1e61d9b64376c479df9c (HEAD -> main) MongoDB : commit 83548bb8e054170ecc4b8fda70390e40fcca5e30 (origin/master, origin/HEAD)

Initially I had an issue of not enough nodes (by default Jepsen starts 5 nodes in docker) as evident by this function jepsen.mongodb.db/shard-node-plan I fixed that by adding 2 more nodes. Then I hit another roadblock, while installing mongoDB on each node, it error'd out saying that a required dependency can't be found, specifically libcurl3 So apparently, libcurl4 and libcurl3 don't work well together and in-spite of efforts I wasn't able to get libcurl3 and mongo running. So I changed the way Jepsen was installing MongoDB and followed the official documentation that installs Mongo 4.2. That worked. But now I am still unable to run the tests as every time there seems to be some SSH related exception saying the control node cant reach the DB nodes.

I changed the installation instructions for MongoDB since the default instructions in setup! were error'ing out due to a libcurl3 dependency. Instructions that I have coded into setup! instead

(defn install!
  [test]
  "Installs MongoDB on the current node."
  (c/su
   (c/exec :mkdir :-p "/tmp/jepsen")
   (let [version (:version test)
         m-version (str/join "." (butlast (str/split "4.2.10" #"\.")))
         versioner #(keyword (str "mongodb-" %1 "=" version))]
     (c/exec :dpkg :--configure :-a)
     (c/exec :apt :-y :--fix-broken :install)
     ()
     (c/exec :apt-get :install :gnupg)
     (c/exec :wget :-qO :-
             (str "https://www.mongodb.org/static/pgp/server-" m-version ".asc")
             :| :apt-key :add :-)
     (c/exec :echo (str "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/" m-version " multiverse") :| :tee (str "/etc/apt/sources.list.d/mongodb-org-" m-version ".list"))
     (c/exec :apt-get :update)
     (c/exec :apt-get :install :-y
             (versioner "org")
             (versioner "org-server")
             (versioner "org-shell")
             (versioner "org-mongos"))
     (c/exec :systemctl :daemon-reload))))

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

@aphyr And what do you know, just as I went to run the tests again hoping to send you a stack trace, they worked ! 🍻 🙂 I will try running them again to see if there is some instability. Other than that if you see any obvious steps that I have missed, do let me know. I ll paste the SSH related exceptions here as soon as I encounter them :)

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

Here are some issues I faced while getting this MongoDB jepsen suite running locally with docker. Information about the code that I am using

Jepsen : commit a2bcad59f0df5bd39cea1e61d9b64376c479df9c (HEAD -> main) MongoDB : commit 83548bb8e054170ecc4b8fda70390e40fcca5e30 (origin/master, origin/HEAD)

Initially I had an issue of not enough nodes (by default Jepsen starts 5 nodes in docker) as evident by this function jepsen.mongodb.db/shard-node-plan I fixed that by adding 2 more nodes. Then I hit another roadblock, while installing mongoDB on each node, it error'd out saying that a required dependency can't be found, specifically libcurl3 So apparently, libcurl4 and libcurl3 don't work well together and in-spite of efforts I wasn't able to get libcurl3 and mongo running. So I changed the way Jepsen was installing MongoDB and followed the official documentation that installs Mongo 4.2. That worked. But now I am still unable to run the tests as every time there seems to be some SSH related exception saying the control node cant reach the DB nodes.

I changed the installation instructions for MongoDB since the default instructions in setup! were error'ing out due to a libcurl3 dependency. Instructions that I have coded into setup! instead

(defn install!
  [test]
  "Installs MongoDB on the current node."
  (c/su
   (c/exec :mkdir :-p "/tmp/jepsen")
   (let [version (:version test)
         m-version (str/join "." (butlast (str/split "4.2.10" #"\.")))
         versioner #(keyword (str "mongodb-" %1 "=" version))]
     (c/exec :dpkg :--configure :-a)
     (c/exec :apt :-y :--fix-broken :install)
     ()
     (c/exec :apt-get :install :gnupg)
     (c/exec :wget :-qO :-
             (str "https://www.mongodb.org/static/pgp/server-" m-version ".asc")
             :| :apt-key :add :-)
     (c/exec :echo (str "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/" m-version " multiverse") :| :tee (str "/etc/apt/sources.list.d/mongodb-org-" m-version ".list"))
     (c/exec :apt-get :update)
     (c/exec :apt-get :install :-y
             (versioner "org")
             (versioner "org-server")
             (versioner "org-shell")
             (versioner "org-mongos"))
     (c/exec :systemctl :daemon-reload))))

Some of the code here for example the --fix-broken stuff is for fixing some weird state that my debian nodes were going into. Please ignore.

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

Huh, okay... I can say that the test is designed for a specific version of debian--it's been a while since I poked my head into the docker and mongo tests, but this miiiight be due to a mismatch between those versions? The libcurl transition has been a real bear: some systems need 3, some 4, etc. etc.

aphyr avatar Dec 03 '20 16:12 aphyr

If this change (the change in setup! to install MongoDB) works well, can I open a PR to submit this change ? What other kinds of tests do you require before taking contributions ? Any guides about other instructions for code contributions ?

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

I think it'd be good to figure out what version of Debian worked before, and what version it works with now, and to document that in the README, for starters! I do apologize, this was a rush job in my free time, and I wasn't as diligent about future-proofing things as I should have been!

aphyr avatar Dec 03 '20 16:12 aphyr

So running it 5 times, caused 1 instance of the test suite crashing

com.mongodb.MongoSocketOpenException: Exception opening socket
        at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:70) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:127) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:131) ~[mongodb-driver-core-4.0.2.jar:na]
        at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[na:na]
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403) ~[na:na]
        at java.base/java.net.Socket.connect(Socket.java:609) ~[na:na]
        at com.mongodb.internal.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:63) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:79) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:65) ~[mongodb-driver-core-4.0.2.jar:na]
        ... 3 common frames omitted
WARN [2020-12-03 16:40:00,246] main - jepsen.core Test crashed!
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Cannot write jepsen.control$session$fn__3025@54bb1068 as tag null
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:na]
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[na:na]
        at clojure.core$deref_future.invokeStatic(core.clj:2300) ~[clojure-1.10.0.jar:na]
        at clojure.core$future_call$reify__8439.deref(core.clj:6974) ~[clojure-1.10.0.jar:na]
        at clojure.core$deref.invokeStatic(core.clj:2320) ~[clojure-1.10.0.jar:na]
        at clojure.core$deref.invoke(core.clj:2306) ~[clojure-1.10.0.jar:na]
        at clojure.core$map$fn__5851.invoke(core.clj:2753) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.seq(LazySeq.java:51) ~[clojure-1.10.0.jar:na]
        at clojure.lang.RT.seq(RT.java:531) ~[clojure-1.10.0.jar:na]
        at clojure.core$seq__5387.invokeStatic(core.clj:137) ~[clojure-1.10.0.jar:na]
        at clojure.core$dorun.invokeStatic(core.clj:3133) ~[clojure-1.10.0.jar:na]
        at clojure.core$dorun.invoke(core.clj:3133) ~[clojure-1.10.0.jar:na]
        at jepsen.store$save_1_BANG_.invokeStatic(store.clj:376) ~[jepsen-0.1.19.jar:na]
        at jepsen.store$save_1_BANG_.invoke(store.clj:372) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_$fn__10005$fn__10012.invoke(core.clj:633) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_$fn__10005.invoke(core.clj:619) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_.invokeStatic(core.clj:605) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_.invoke(core.clj:531) ~[jepsen-0.1.19.jar:na]
        at jepsen.cli$test_all_run_tests_BANG_$fn__10790.invoke(cli.clj:422) ~[jepsen-0.1.19.jar:na]
        at clojure.core$map_indexed$mapi__8533$fn__8534.invoke(core.clj:7308) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.10.0.jar:na]

I think the exceptions I was seeing earlier were of a similar nature

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

Ah, well that looks like there's a problem in the MongoDB setup process--it's not accepting connections. Likely a race condition between the code and MongoDB itself, if it's sporadic. Maybe there needs to be some additional health checks during db/setup!...

aphyr avatar Dec 03 '20 16:12 aphyr

From the dockerfile, I can see that the docker image is based on this Debian docker image : https://github.com/jgoerzen/docker-debian-base-standard

I am not sure I understand when you say what version of Debian worked before, and what version it works with now. Do you mean what mongo debian distro, which the code is pulling from https://repo.mongodb.org/apt/debian/dists/stretch/mongodb-org/4.2/main/binary-amd64/ ?

If I can help with updating the README, do let me know I can do that.

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

Ah, well that looks like there's a problem in the MongoDB setup process--it's not accepting connections. Likely a race condition between the code and MongoDB itself, if it's sporadic. Maybe there needs to be some additional health checks during db/setup!...

Do you mean something like

echo 'db.runCommand("ping").ok' | mongo localhost:27017/test --quiet

To check if the mongo service is up and running ?

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

More that I'm not sure whether this ever worked with the Docker setup, and if you're having problems, it might be because this version of Jepsen and the version of Mongo it installs were intended to run on, say, Jessie, when the Docker env is giving you, say, Bullseye. I honestly forget, so much has happened this year. I'd love to go dig into this for you but I am scrambling to keep up with waaaay too much client stuff right now!

aphyr avatar Dec 03 '20 16:12 aphyr

echo 'db.runCommand("ping").ok' | mongo localhost:27017/test --quiet

Maybe. I think the current code probably does its own health checks already... lemme check. Ah, yes, here it is:

https://github.com/jepsen-io/mongodb/blob/83548bb8e054170ecc4b8fda70390e40fcca5e30/src/jepsen/mongodb/db.clj#L183-L185

We've got blocking on individual node startup, blocking on cluster join, blocking on elections, blocking on the cluster, blocking on the primary. That is, apparently, not enough blocking! This isn't just you: Mongo's... historically been difficult to set up reliably.

aphyr avatar Dec 03 '20 16:12 aphyr

More that I'm not sure whether this ever worked with the Docker setup, and if you're having problems, it might be because this version of Jepsen and the version of Mongo it installs were intended to run on, say, Jessie, when the Docker env is giving you, say, Bullseye. I honestly forget, so much has happened this year. I'd love to go dig into this for you but I am scrambling to keep up with waaaay too much client stuff right now!

Aah, that makes sense. FWIW, the debian version that the current jepsen's main branch sets up is buster. If you do come across some pointers on what this was originally supposed to run on, let me know. I can look at the jepsen code as of the time this mongodb test suite was initially created. Maybe that can give some pointers on the debian version ?

rhishikeshj avatar Dec 03 '20 16:12 rhishikeshj

Ooof, yeah, Again, I'm sorry. This is a holdover from an older time in Jepsen when Debian versions lasted (compared to the lifetime of a test) forever and were often cross-compatible: we never really established a convention around OS versioning. Now that people are trying to dredge up tests written n years ago (or even 7 months ago!), those assumptions don't always hold.

This is a good reminder to me to write more of that documentation, and start splitting out future jepsen.os.debian/os objects into specific versions.

It looks like this test uses jepsen 0.1.19, which... I think should be using Jessie. Jepsen 0.2.1 transitioned to Buster.

aphyr avatar Dec 03 '20 17:12 aphyr

From this commit It seems the control node used ubuntu and the db nodes used stretch around the time these mongo tests were written. Am I looking at this correctly ?

rhishikeshj avatar Dec 03 '20 17:12 rhishikeshj

Oh, yeah, but that doesn't (and I am so sorry, I know this is confusing) mean this test was supposed to work with Docker. The docker directory was contributed by other people--I hadn't used it myself, and its maintainers drifted off to do other things, so it drifted behind. I test primarily using LXC and AWS, and was running Jessie at the time, I think. That's why this test was written for Jessie, and probably won't work with either the old or new docker setups, since they're for Stretch and Buster.

So, I think you've got two options here. One is if you get a Jessie environment going (are the mirrors still around?) you should be able to run the test as-is. The other is using Buster and figuring out how to port the test forward to Buster, which miiight be as simple as bumping the version of jepsen in project.clj to 0.2.1+.

aphyr avatar Dec 03 '20 17:12 aphyr

Okay, I understand now. Thanks.

As regards the 2 options, I would say bringing the tests up to date is more fruitful in the longer run. I can give that a crack to see what else needs changing. Right off the bat, I think there are some code changes that might be needed. Currently mongodb.clj seems to depend on [jepsen.generator.pure :as gen] which isn't there in jepsen/0.2.1

Strangely, in the source code, I see this namespace mentioned in the docs but only see it used in the dgraph code. Where does this namespace come from in the latest jepsen code ?

rhishikeshj avatar Dec 03 '20 17:12 rhishikeshj

Currently mongodb.clj seems to depend on [jepsen.generator.pure :as gen] which isn't there in jepsen/0.2.1

Ah, now THIS I actually have good docs for! https://github.com/jepsen-io/jepsen/releases/tag/0.2.0

aphyr avatar Dec 03 '20 17:12 aphyr

(also be advised there's bug in 0.2.0 that might affect generators--best to jump straight to 0.2.1 I think)

aphyr avatar Dec 03 '20 17:12 aphyr

Okay, so this morning I seem to be able to get the original SSH related exceptions rather frequently :

WARN [2020-12-04 03:17:54,150] jepsen node n4 - jepsen.control Encountered error with conn [:control "n4"]; reopening
java.lang.InterruptedException: sleep interrupted
        at java.base/java.lang.Thread.sleep(Native Method)
        at clj_ssh.ssh$ssh_exec.invokeStatic(ssh.clj:690)
        at clj_ssh.ssh$ssh_exec.invoke(ssh.clj:670)
        at clj_ssh.ssh$ssh.invokeStatic(ssh.clj:723)
        at clj_ssh.ssh$ssh.invoke(ssh.clj:699)
        at jepsen.control.SSHRemote.execute_BANG_(control.clj:331)
        at jepsen.control$ssh_STAR_$fn__3063.invoke(control.clj:172)
        at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172)
        at jepsen.control$ssh_STAR_.invoke(control.clj:168)
        at jepsen.control$exec_STAR_.invokeStatic(control.clj:194)
        at jepsen.control$exec_STAR_.doInvoke(control.clj:191)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:665)
        at clojure.core$apply.invoke(core.clj:660)
        at jepsen.control$exec.invokeStatic(control.clj:210)
        at jepsen.control$exec.doInvoke(control.clj:204)
        at clojure.lang.RestFn.invoke(RestFn.java:436)
        at jepsen.db$tcpdump$reify__3446.teardown_BANG_(db.clj:112)
        at jepsen.mongodb.db.ShardedDB.teardown_BANG_(db.clj:406)
        at jepsen.db$fn__3273$G__3269__3277.invoke(db.clj:11)
        at jepsen.db$fn__3273$G__3268__3282.invoke(db.clj:11)
        at clojure.core$partial$fn__5824.invoke(core.clj:2625)
        at jepsen.control$on_nodes$fn__3161.invoke(control.clj:430)

This is for node n4 but similar exceptions happen for all nodes. A simple ssh n4 from the control node seems to work so there isn't an obvious problem with the docker cluster. Any pointers for me to explore here ?

rhishikeshj avatar Dec 04 '20 03:12 rhishikeshj

That's a long-standing bug in the SSH library--some kind of race condition I think. We can generally recover transparently.On Dec 3, 2020 22:35, Rhishikesh [email protected] wrote: Okay, so this morning I seem to be able to get the original SSH related exceptions rather frequently : WARN [2020-12-04 03:17:54,150] jepsen node n4 - jepsen.control Encountered error with conn [:control "n4"]; reopening java.lang.InterruptedException: sleep interrupted at java.base/java.lang.Thread.sleep(Native Method) at clj_ssh.ssh$ssh_exec.invokeStatic(ssh.clj:690) at clj_ssh.ssh$ssh_exec.invoke(ssh.clj:670) at clj_ssh.ssh$ssh.invokeStatic(ssh.clj:723) at clj_ssh.ssh$ssh.invoke(ssh.clj:699) at jepsen.control.SSHRemote.execute_BANG_(control.clj:331) at jepsen.control$ssh_STAR_$fn__3063.invoke(control.clj:172) at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172) at jepsen.control$ssh_STAR_.invoke(control.clj:168) at jepsen.control$exec_STAR_.invokeStatic(control.clj:194) at jepsen.control$exec_STAR_.doInvoke(control.clj:191) at clojure.lang.RestFn.applyTo(RestFn.java:137) at clojure.core$apply.invokeStatic(core.clj:665) at clojure.core$apply.invoke(core.clj:660) at jepsen.control$exec.invokeStatic(control.clj:210) at jepsen.control$exec.doInvoke(control.clj:204) at clojure.lang.RestFn.invoke(RestFn.java:436) at jepsen.db$tcpdump$reify__3446.teardown_BANG_(db.clj:112) at jepsen.mongodb.db.ShardedDB.teardown_BANG_(db.clj:406) at jepsen.db$fn__3273$G__3269__3277.invoke(db.clj:11) at jepsen.db$fn__3273$G__3268__3282.invoke(db.clj:11) at clojure.core$partial$fn__5824.invoke(core.clj:2625) at jepsen.control$on_nodes$fn__3161.invoke(control.clj:430)

This is for node n4 but similar exceptions happen for all nodes. A simple ssh n4 from the control node seems to work so there isn't an obvious problem with the docker cluster. Any pointers for me to explore here ?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

aphyr avatar Dec 04 '20 03:12 aphyr

@aphyr And what do you know, just as I went to run the tests again hoping to send you a stack trace, they worked ! 🍻 🙂 I will try running them again to see if there is some instability. Other than that if you see any obvious steps that I have missed, do let me know. I ll paste the SSH related exceptions here as soon as I encounter them :)

Oh bro! It is exciting that you have delt with the problem that running mongodb jepsen test in docker-compose, even though the test may crash in some situations. In my previous work, I rent some sever to run this test suite, which is expensive so I didn't go on. You have done, and only done, two things to fix the bug right?

  1. Adding 2 more nodes
  2. Change the installation instructions in setup! function

I am interested to your work and it would be help if you could share you config and fixment. Thanks.

Tsunaou avatar Dec 04 '20 09:12 Tsunaou

I had a chance to go through the mongo code today and get everything fixed up for the lastest Jepsen and Debian Buster.

aphyr avatar Dec 06 '20 21:12 aphyr

@aphyr nice ! 😊 Would love to see that happen. Also I have opened a pull request making some of the changes for jepsen 0.2.1 Let me know if that's mergeable.

rhishikeshj avatar Dec 07 '20 07:12 rhishikeshj