metagraph Add checkpointing to metagraph build

As discussed via chat, there are multiple checkpoints after each major operation: generate kmers, generate reverse complements, splitting into chunks, generate dummy1, concatenate chunks, generate dummy2..k.

The client interface exposes only 2 phases:

phase 1 builds everything up to and including dummy k-mers
phase 2 finishes the build

I tested by inserting exit statements in the code after each phase and making sure the build is successful on resume. I also added functional tests for building in phases.

Sep 08 '20 16:09 danieldanciu

Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated.

So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine.

Sep 08 '20 17:09 karasikov

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic.

On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov [email protected] wrote:

Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated.

So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-689028341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q .

Sep 08 '20 17:09 danieldanciu

Ok, I added a test that builds 2 graphs at the same time. First it builds phase 1 for both graphs, then it builds phase2 for both and asserts that the graphs are correctly built.

On Tue, 8 Sep 2020 at 19:37, Daniel Danciu [email protected] wrote:

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic.

On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov [email protected] wrote:

Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated.

So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-689028341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q .

Sep 09 '20 07:09 danieldanciu

It's better to have non-deterministic tests than no tests. The expected result is well determined. The second run of metagraph started after a previous run that was killed, must construct a valid graph. So it's a perfectly defined test, isn't it?

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic. … On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov @.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q .

Sep 18 '20 17:09 karasikov

Flakey/nondeterministic tests are a pest. There is general agreement about this in the testing community, just google "flakey tests". Your argument that the expected result is well determined doesn't mean that the test is not flakey - all tests have a determined expected result. It's not the expected result that makes a test flakey, but the actual result. A test that sometimes passes and sometimes not is harmful. It erodes the developers' trust in the testing system and often gets ignored with a "well, it works for me" shrug. Flakey tests are also extremely hard to debug, as they are not reproducible. There's hoards of literature teaching developers how not to write non-deterministic tests, so writing a test that is intentionally non-deterministic is really against the trend here.

If we were writing aviation software where people would die if a smooth recovery was not always possible, then we'd figure out a way to run such tests (e.g. introduce exit points at well-defined points in the code). But in our case, recovering a failed computation is a best effort thing, so if there is a hidden bug that causes the computation to be unrecoverable if the crash happens exactly at that point, well, then so be it.

On Fri, 18 Sep 2020 at 19:02, Mikhail Karasikov [email protected] wrote:

It's better to have non-deterministic tests than no tests. The expected result is well determined. The second run of metagraph started after a previous run that was killed, must construct a valid graph. So it's a perfectly defined test, isn't it?

I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic. … <#m_7657782487212270277_> On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov @.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment) https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-689028341>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-694979363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLEACCOSVEQQEMX2BWLSGOHB7ANCNFSM4RAHV52Q .

Sep 19 '20 09:09 danieldanciu

A test that sometimes passes and sometimes not is harmful.

I agree that we should never write such tests. But I don't get what does it have to do with our case. The second run of metagraph must always build the graph, so the test must always pass. This is what I mean by the deterministic expected result.

Can you explain why you think this would be "Flakey"?

Flakey/nondeterministic tests are a pest. There is general agreement about this in the testing community, just google "flakey tests". Your argument that the expected result is well determined doesn't mean that the test is not flakey - all tests have a determined expected result. It's not the expected result that makes a test flakey, but the actual result. A test that sometimes passes and sometimes not is harmful. It erodes the developers' trust in the testing system and often gets ignored with a "well, it works for me" shrug. Flakey tests are also extremely hard to debug, as they are not reproducible. There's hoards of literature teaching developers how not to write non-deterministic tests, so writing a test that is intentionally non-deterministic is really against the trend here. If we were writing aviation software where people would die if a smooth recovery was not always possible, then we'd figure out a way to run such tests (e.g. introduce exit points at well-defined points in the code). But in our case, recovering a failed computation is a best effort thing, so if there is a hidden bug that causes the computation to be unrecoverable if the crash happens exactly at that point, well, then so be it. On Fri, 18 Sep 2020 at 19:02, Mikhail Karasikov [email protected] wrote: … It's better to have non-deterministic tests than no tests. The expected result is well determined. The second run of metagraph started after a previous run that was killed, must construct a valid graph. So it's a perfectly defined test, isn't it? I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic. … <#m_7657782487212270277_> On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov @.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment) <#197 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLEACCOSVEQQEMX2BWLSGOHB7ANCNFSM4RAHV52Q .

Sep 19 '20 09:09 karasikov

Flakey is a test that:

passes most of the time, but in relatively rare conditions it fails
the conditions under which it fails are unknown and cannot be (easily) reproduced

Let's assume that there is the following completely plausible problem in my current code: if the execution of the program is stopped right after the chunks finished merging, but before the buffer of the reverse complements was flushed out to disk, the checkpointing won't work, as the data is in an inconsistent state: old chunks were deleted, and new reverse complements are not yet flushed out. Depending on the size of the graph, the chances that we stop the program right in this point can be one in a thousand or one in a trillion. Let's say one in a thousand, since the test graphs are small. So we'll have a test that passes 999 times out of 1000 tries. That's a flakey test. Even if it fails, developers tend to ignore it, because it passes most of the time, so they just assume that something was wrong on the file system or something and move on. So the test brings no value. There is little incentive to try to fix it, as the cost is very high (hard to reproduce) and the benefit is very low (well, it passes most of the time anyway). At Google there are entire teams of test engineers whose role, among others, is to make sure that tests are reliable and either always pass or they reproducibly fail.

In actual fact, I am almost convinced that although the current code will work in most cases when the computation dies (because I inserted random breakpoints in the code and tried it), there are still some rare circumstances under which the data is inconsistent and the computation can't be resumed.

On Sat, 19 Sep 2020 at 11:46, Mikhail Karasikov [email protected] wrote:

A test that sometimes passes and sometimes not is harmful.

I agree that we should never write such tests. But I don't get what does it have to do with our case. The second run of metagraph must always build the graph, so the test must always pass. This is what I mean by the deterministic expected result.

Can you explain why do you think this would be "Flakey"?

Flakey/nondeterministic tests are a pest. There is general agreement about this in the testing community, just google "flakey tests". Your argument that the expected result is well determined doesn't mean that the test is not flakey - all tests have a determined expected result. It's not the expected result that makes a test flakey, but the actual result. A test that sometimes passes and sometimes not is harmful. It erodes the developers' trust in the testing system and often gets ignored with a "well, it works for me" shrug. Flakey tests are also extremely hard to debug, as they are not reproducible. There's hoards of literature teaching developers how not to write non-deterministic tests, so writing a test that is intentionally non-deterministic is really against the trend here. If we were writing aviation software where people would die if a smooth recovery was not always possible, then we'd figure out a way to run such tests (e.g. introduce exit points at well-defined points in the code). But in our case, recovering a failed computation is a best effort thing, so if there is a hidden bug that causes the computation to be unrecoverable if the crash happens exactly at that point, well, then so be it. On Fri, 18 Sep 2020 at 19:02, Mikhail Karasikov [email protected] wrote: … <#m_2740962524763661405_> It's better to have non-deterministic tests than no tests. The expected result is well determined. The second run of metagraph started after a previous run that was killed, must construct a valid graph. So it's a perfectly defined test, isn't it? I can add a test that builds 2 graphs in parallel (using phases), but I won't kill the process as that causes the test to be undeterministic. … <#m_7657782487212270277_> On Tue, 8 Sep 2020 at 19:31, Mikhail Karasikov @.***> wrote: Could you add a test where it builds one graph, kills the process, so it fails to build to the end. Then it does the same thing for a different graph. Then restarts building both graphs in parallel, and checks that both graphs are correctly generated. So we check that there are no collisions here and you can correctly generate multiple graphs in parallel on the same machine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 https://github.com/ratschlab/projects2014-metagenome/pull/197 (comment) <#197 (comment) https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-689028341>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNGGYLC7VF3SD5WZLSEZS75ANCNFSM4RAHV52Q . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#197 (comment) https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-694979363>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLEACCOSVEQQEMX2BWLSGOHB7ANCNFSM4RAHV52Q .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-695192068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLCNY2QRTXVQ2BQUMATSGR4VVANCNFSM4RAHV52Q .

Sep 19 '20 11:09 danieldanciu

I thought this can be used to resume crashed builds, so there is a guarantee that any run which finishes with exit code 0 builds a valid graph.

But you sound like this is not reliable enough to be used in such scenarios and one should start from scratch if a run fails.

So maybe we should use this to split the build into stages, but not use this to resume crashed runs?

Sep 19 '20 13:09 karasikov

Or maybe it was not meant to work in these scenarios (resume a crashed build) in the first place?

In this case, this test, of course, does not make sense. But we should write somewhere that one should start the build from scratch (checkpoint 0) if a previous run fails.

Sep 19 '20 13:09 karasikov

This discussion is getting surreal. That was just a thought experiment. To the best of my knowledge, any build that finishes with exit code 0 builds a valid graph. As you could see in the code review, I took extra care to make sure that we never leave the data in an inconsistent state or that we never delete data too early. What I am saying is that there is a very small chance that if the program crashes in a very specific way we might to restart the build. And that is acceptable. Even if out of 1000 crashed builds, we can save 999, this code is pretty valuable.

To make an analogy, it already is the case that there is a small chance that we give the program an input and it crashes while building the graph. So metagraph build does suffer from some flaw that is on my list to investigate. Out of maybe 50'000 graphs built, there is one graph where it just crashes, and it's not because of insufficient memory. Should we just delete metagraph and build graphs by hand because of this ? No, of course not. The same is true for this code. I simply don't understand why we are even having this discussion.

On Sat, 19 Sep 2020 at 15:10, Mikhail Karasikov [email protected] wrote:

Or maybe it was not meant to work in these scenarios (resume a crashed build) in the first place?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ratschlab/projects2014-metagenome/pull/197#issuecomment-695211527, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOYQLGTBZYM52QBVWLL6BDSGSUTLANCNFSM4RAHV52Q .

Sep 19 '20 19:09 danieldanciu

To the best of my knowledge, any build that finishes with exit code 0 builds a valid graph.

Up to this moment, this has been true, and we do check this in integration tests, by the way. However, if this PR introduces the feature for resuming a crashed run and starts with some pre-generated output, this is no longer obvious that the result will be a valid graph even if the exit code is zero. The test I'm proposing is supposed to check exactly this (exit code 0 => the graph is correct).

To make an analogy, it already is the case that there is a small chance that we give the program an input and it crashes while building the graph. So metagraph build does suffer from some flaw that is on my list to investigate. Out of maybe 50'000 graphs built, there is one graph where it just crashes, and it's not because of insufficient memory. Should we just delete metagraph and build graphs by hand because of this ? No, of course not. The same is true for this code. I simply don't understand why we are even having this discussion.

I wasn't aware of this. This might be a bug, which we have to find and fix. I don't think this example can be used as an analogy with this case. There is nothing in common.

The thing I want to avoid is that we claim that a failed run can be resumed, then someone does this, and the second run of metagraph finishes with exit code 0, but the graph it's constructed is not correct (has missing k-mers, or invalid at all). So if we claim that the resuming is supported, we must be able to add the test. Otherwise you say that one can resume a failed run even though the result sometimes can be incorrect and there is no way to check it except rebuilding the graph again from scratch and comparing the two graphs. I hope it's clear that this is not a good thing.

Sep 19 '20 20:09 karasikov

Just to clarify, I'm proposing to add a test that checks the generated graph only if the last run finishes with exit code 0. If it fails too, the test would pass. So it's ok to have some states from which we can't continue, but we must be able to detect this case, and the test would be supposed to check this.

Sep 19 '20 20:09 karasikov

metagraph metagraph copied to clipboard

Add checkpointing to metagraph build

metagraph
metagraph copied to clipboard