lambdacd icon indicating copy to clipboard operation
lambdacd copied to clipboard

LambdaCD does not start when when persistent state is corrupted

Open abendt opened this issue 7 years ago • 5 comments
trafficstars

we use LambdaCD with file-based persistence. Sometimes during shutdown it seems that file is corrupted. Afterwards LambdaCD does not start anymore:

Aug 27 12:29:58 tyr-ci-01 java[1878]: Exception in thread "main" java.lang.NumberFormatException: null
Aug 27 12:29:58 tyr-ci-01 java[1878]: at java.lang.Integer.parseInt(Integer.java:542)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at java.lang.Integer.parseInt(Integer.java:615)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.util.internal.sugar$parse_int.invokeStatic(sugar.clj:5)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.util.internal.sugar$parse_int.invoke(sugar.clj:4)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$build_number_from_path.invokeStatic(default_pipeline_state_persistence.clj:45)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$build_number_from_path.invoke(default_pipeline_state_persistence.clj:44)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_pipeline_structure_edn.invokeStatic(default_pipeline_state_persistence.clj:96)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_pipeline_structure_edn.invoke(default_pipeline_state_persistence.clj:95)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$map$fn__5587.invoke(core.clj:2747)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.LazySeq.sval(LazySeq.java:40)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.LazySeq.seq(LazySeq.java:49)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.Cons.next(Cons.java:39)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.RT.next(RT.java:706)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$next__5108.invokeStatic(core.clj:64)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7852.invokeStatic(protocols.clj:169)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7852.invoke(protocols.clj:124)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7807$G__7802__7816.invoke(protocols.clj:19)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7835.invokeStatic(protocols.clj:75)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7835.invoke(protocols.clj:75)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7781$G__7776__7794.invoke(protocols.clj:13)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$reduce.invokeStatic(core.clj:6748)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$into.invokeStatic(core.clj:6815)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$into.invoke(core.clj:6807)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_build_datas.invokeStatic(default_pipeline_state_persistence.clj:105)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_build_datas.invoke(default_pipeline_state_persistence.clj:101)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state$new_default_pipeline_state.invokeStatic(default_pipeline_state.clj:76)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state$new_default_pipeline_state.doInvoke(default_pipeline_state.clj:73)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.RestFn.invoke(RestFn.java:410)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.core$assemble_pipeline.invokeStatic(core.clj:42)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.core$assemble_pipeline.invoke(core.clj:37)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd_pipeline.wishlistui.wishlistui$wishlistui_pipeline.invokeStatic(wishlistui.clj:56)

we fixed the problem by deleting the workspace. Maybe there are some ways this could be improved within LambdaCD? e.g. ignoring a previous build when it's file cannot be read.

abendt avatar Aug 27 '18 17:08 abendt

It's definitely possible to do, even though I usually prefer to fail fast as it's more explicit to the user: "oh, my state is corrupted" vs "hmm, somehow one of my builds disappeared".

That aside, I'd like to understand how we got into this state in the first place. From looking at the stack trace and the code it looks like there were directories like build-something-thats-not-a-number in the home-directory and I'm wondering how they got there.

If this happens again, can you have a look into the home directory and post an ls?

flosell avatar Aug 28 '18 01:08 flosell

I just looked at the code a bit more and found it definitely inconsistent. It looked at all directories starting with build- but then expected the build-number after the dash. That should fixed now and is released in 0.14.2 so I'm closing this issue for now. If the problem re-appears, feel free to re-open.

I'd still be curious how such directories ended up there so if you find out, please drop a note, maybe there's another bug hiding somewhere.

flosell avatar Sep 01 '18 08:09 flosell

@flosell we just upgraded to 0.14.2. However it does not seem to resolve the issue. Directory listing:

build-31 build-32 build-33 build-34 build-35 build-36 build-37 build-38 build-39 build-40 lambdacd730621214759903273 lambdacd-artifacts

abendt avatar Sep 05 '18 11:09 abendt

Hi @abendt, sorry for the late reply, was busy with a few other things lately.

I looked into your problem again but couldn't find a way to reproduce this problem or understand why it's happening. However, I refactored the code to make it easier to reason about and possibly more robust. I'll release this as 0.14.3, have a look if this fixes the problem. If it does, could you set logging to DEBUG level and post messages that contain doesn't seem to contain a valid build number? Maybe we'll find out this way which files are responsible.

flosell avatar Sep 22 '18 07:09 flosell

Will do. Thanks you!

abendt avatar Sep 22 '18 08:09 abendt