xtdb icon indicating copy to clipboard operation
xtdb copied to clipboard

Checkpointing with LMDB fails with "Directory exists"...once?

Open arichiardi opened this issue 2 years ago • 6 comments

Hi there, we detected a very weird issue that does not seem to be a blocker but we are not yet sure.

EDIT: this seems to result in empty query results so there is definitely something going on.

Basically with this configuration:

(defn configure-index
  [{:keys [index index-path checkpoint-enabled? checkpoint-duration-hours checkpoint-path]
    :or {checkpoint-path "xtdb/checkpoints" index-path "xtdb/index"}}]
  (condp = (keyword index)
    ;;;;;;;;;;;;;;;; We use this branching
    :lmdb (cond-> {:xtdb/module 'xtdb.lmdb/->kv-store :db-dir (io/file index-path)}
            checkpoint-enabled?
            (merge
             ;;;;;;;;;;;;;;;; 
             ;;;;;;;;;;;;;;;; Note no `:checkpoint-dir`
             ;;;;;;;;;;;;;;;; 
             {:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
                             :store {:xtdb/module 'xtdb.checkpoint/->filesystem-checkpoint-store
                                       :path checkpoint-path}
                             :approx-frequency (Duration/ofSeconds 1)}}))
    :kv-memory {:xtdb/module `xtdb.mem-kv/->kv-store}))

We get an error at the first checkpointing, then everything seems to work?

user=> 2022-04-11 15:36:28,693 WARN xtdb.checkpoint - Checkpointing failed
xtdb.IllegalArgumentException: Directory exists: /tmp/checkpointing15491662742877653634
at xtdb.error$illegal_arg.invokeStatic(error.clj:12)
at xtdb.error$illegal_arg.invoke(error.clj:3)
at xtdb.error$illegal_arg.invokeStatic(error.clj:7)
at xtdb.error$illegal_arg.invoke(error.clj:3)
at xtdb.lmdb$env_copy.invokeStatic(lmdb.clj:104)
at xtdb.lmdb$env_copy.invokePrim(lmdb.clj)
at xtdb.lmdb.LMDBKv.save_checkpoint(lmdb.clj:258)
at xtdb.checkpoint$checkpoint.invokeStatic(checkpoint.clj:40)
at xtdb.checkpoint$checkpoint.invoke(checkpoint.clj:37)
at xtdb.checkpoint.ScheduledCheckpointer$run__120762$fn__120766.invoke(checkpoint.clj:70)
at xtdb.checkpoint.ScheduledCheckpointer$run__120762.invoke(checkpoint.clj:69)
at xtdb.checkpoint.ScheduledCheckpointer$schedule__120773$fn__120774.invoke(checkpoint.clj:90)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
2022-04-11 15:36:29,702 INFO xtdb.checkpoint - Uploading checkpoint at '{:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}'
2022-04-11 15:36:29,712 INFO xtdb.checkpoint - Uploaded checkpoint: {:xtdb.checkpoint/cp-format {:index-version 20, :xtdb.lmdb/version "3"}, :tx {:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}, :xtdb.checkpoint/cp-uri "xtdb/checkpoints/checkpoint-617-2022-04-11T21:36:29.702-00:00", :xtdb.checkpoint/checkpoint-at #inst "2022-04-11T21:36:29.702-00:00"}
2022-04-11 15:36:31,695 INFO xtdb.checkpoint - Uploading checkpoint at '{:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}'
2022-04-11 15:36:31,707 INFO xtdb.checkpoint - Uploaded checkpoint: {:xtdb.checkpoint/cp-format {:index-version 20, :xtdb.lmdb/version "3"}, :tx {:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}, :xtdb.checkpoint/cp-uri "xtdb/checkpoints/checkpoint-617-2022-04-11T21:36:31.696-00:00", :xtdb.checkpoint/checkpoint-at #inst "2022-04-11T21:36:31.696-00:00"}
2022-04-11 15:36:33,694 INFO xtdb.checkpoint - Uploading checkpoint at '{:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}'
2022-04-11 15:36:33,703 INFO xtdb.checkpoint - Uploaded checkpoint: {:xtdb.checkpoint/cp-format {:index-version 20, :xtdb.lmdb/version "3"}, :tx {:xtdb.api/tx-time #inst "2022-04-11T21:36:28.421-00:00", :xtdb.api/tx-id 617}, :xtdb.checkpoint/cp-uri "xtdb/checkpoints/checkpoint-617-2022-04-11T21:36:33.694-00:00", :xtdb.checkpoint/checkpoint-at #inst "2022-04-11T21:36:33.694-00:00"}

Not entirely sure this helps/is a bug but worth bringing it up I guess.

arichiardi avatar Apr 11 '22 21:04 arichiardi

Thanks for opening the issue - it does seem weird to me also.

Are checkpoints being uploaded every ~2 seconds indefinitely? (i.e. beyond your presumably truncated log output) The configured duration shows 6 hours :thinking: :thinking:

Are you running multiple nodes at the same time?

refset avatar Apr 11 '22 22:04 refset

@refset Sorry I forgot to change that part, I was using ofSeconds 1 for my dev testing - yes indefinitely and no only one node.

arichiardi avatar Apr 11 '22 22:04 arichiardi

Hi, I think I've worked out what's causing the issue. The lmdb module has the following code:

(defn- env-copy [^long env path]
  (let [file (io/file path)]
    (when (.exists file)
      (throw (err/illegal-arg :directory-exists
                              {::err/message (str "Directory exists: " (.getAbsolutePath file))})))
    (.mkdirs file)
    (success? (LMDB/mdb_env_copy env (.getAbsolutePath file)))))

The temporary folder for creating the checkpoint must not exist before the lmdb env-copy command is run. However in the checkpointer code we have:

  (start [this src {::keys [cp-format]}]
    (let [checkpoint-dir (or (some-> checkpoint-dir .toFile) (xio/create-tmpdir "checkpointing"))

which calls xio/create-tmpdir which creates the temporary folder via File/createTempDirectory.

So on the first run the /tmp/checkpointing temporary folder is actually created and will cause the error as seen above. However the checkpointer will remove this folder:

                (try
                  (checkpoint {:approx-frequency approx-frequency,
                               :dir checkpoint-dir,
                               :src src,
                               :store store,
                               ::cp-format cp-format})
                  (catch Exception e
                    (log/warn e "Checkpointing failed"))
                  (catch Throwable t
                    (log/warn t "Checkpointing failed, stopping checkpointing")
                    (throw t))
                  (finally
                    ;; FOLDER REMOVED HERE:
                    (when-not keep-dir-between-checkpoints?
                      (xio/delete-dir checkpoint-dir))))

So subsequent runs succeed. A workaround is to explicitly set the :checkpoint-dir because it isn't created automatically.

smitjh avatar Apr 12 '22 00:04 smitjh

Would a good solution here just to drop xio/create-tmpdir and just return a random Path (or string) from that or?

For instance we could just use the java.io.tmpdir property for getting the path to the OS temp and manually build the path, adding a random string to it like the original function does, similar to what this gist does.

arichiardi avatar Apr 19 '22 22:04 arichiardi

On more thought, we don't even need to have random number cause in case of LMDB the files within the dir will contain them.

arichiardi avatar Apr 19 '22 23:04 arichiardi

This is what I am using now internally and seems to work as checkpoint-dir quite fine:

(defn tmp-dir-path
  ([] (tmp-dir-path nil))
  ([subdir]
   (let [base-dir (System/getProperty "java.io.tmpdir")
         base-name (or subdir (str (System/currentTimeMillis) "-" (long (rand 1000000000))))]
     (io/file base-dir base-name))))

We don't clean it up at the moment on app shutdown, unlike yours.

arichiardi avatar Apr 20 '22 23:04 arichiardi

I think we are still using this workaround but somehow we see the error again in production

310a13ed139f 2023-10-08 00:53:48,075 WARN xtdb.checkpoint - Checkpointing failed
310a13ed139f xtdb.IllegalArgumentException: Directory exists: /tmp/xtdb/checkpointing
310a13ed139f   at xtdb.error$illegal_arg.invokeStatic(error.clj:12)
310a13ed139f   at xtdb.error$illegal_arg.invoke(error.clj:3)
310a13ed139f   at xtdb.error$illegal_arg.invokeStatic(error.clj:7)
310a13ed139f   at xtdb.error$illegal_arg.invoke(error.clj:3)
310a13ed139f   at xtdb.lmdb$env_copy.invokeStatic(lmdb.clj:130)
310a13ed139f   at xtdb.lmdb$env_copy.invokePrim(lmdb.clj)
310a13ed139f   at xtdb.lmdb.LMDBKv.save_checkpoint(lmdb.clj:295)
310a13ed139f   at xtdb.checkpoint$checkpoint.invokeStatic(checkpoint.clj:83)
310a13ed139f   at xtdb.checkpoint$checkpoint.invoke(checkpoint.clj:79)
310a13ed139f   at xtdb.checkpoint.ScheduledCheckpointer$run__125780$fn__125784.invoke(checkpoint.clj:127)

as I was doing this:

(assoc :checkpoint-tmp-dir (utils/tmp-dir-path "xtdb/checkpointing"))

@refset suggested it might be enough to just add a xio/delete-dir here?

arichiardi avatar Oct 11 '23 20:10 arichiardi