holochain-rust icon indicating copy to clipboard operation
holochain-rust copied to clipboard

Remove instance drop impl

Open timotree3 opened this issue 5 years ago • 5 comments

PR summary

The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.

Since we used a derived implementation of Clone and the data was held in an Arc, it was possible to exploit by calling .clone(), making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

testing/benchmarking notes

A regression test was added.

changelog

Please check one of the following, relating to the CHANGELOG-UNRELEASED.md

  • [ ] this is a code change that effects some consumer (e.g. zome developers) of holochain core so it is added to the CHANGELOG-UNRELEASED.md (linked above), with the format - summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)
  • [x] this is not a code change, or doesn't effect anyone outside holochain core development

timotree3 avatar Oct 20 '19 17:10 timotree3

Thanks for reviewing this. :-) Let me do my best to explain my thinking about removing the Drop implementation for Instance. Sorry... I should've given a PR summary that explained this better.

The reason I deleted that part of the code is because I only see two possible things it could do, both of which are undesirable:

  • Case 1: That instance is the only remaining holder of a reference to the State, the manual drop doesn't matter because after the instance drops its fields, it will call the Arc drop implementation which drops the State if it is the last reference left.
    • Outcome: the code didn't matter because the data it manually drops would've been dropped anyway.
  • Case 2: Other parts of the code still hold references to that State, (presumably because those parts of the code intend to read from those references), after we replace the valid state with a sentinel value that causes a panic when accessed (i.e. StateWrapper(None)), we are sure to cause a panic! after it is read from.
    • Outcome: the code panic!s

I admit that I'm assuming two things:

  • That panic!s are undesirable and should only be the result of a broken invariant that the programmer didn't foresee.
  • That parts of the code would drop an Arc that they don't intend to use.

I also admit that the above explanation lied about something. The Drop impl does do more than drop the inner State. It has three lines of code:

https://github.com/holochain/holochain-rust/blob/6ea8fa63170728a6170f8c26d56638386e2308bf/crates/core/src/instance.rs#L338-L346

The first line calls a function that returns a future. Since futures are evaluated lazily, it does nothing.

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

The third line is the one I just explained why I think is misguided.


Now let me show an example of where I think the second line is harmful (taken from the regression test added in 6ea8fa6):

#[test]
pub fn can_ping_instance() {
    let instance = test_instance_blank();
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

#[test]
pub fn can_clone_instance() {
    let instance = test_instance_blank();
    {
        let _instance2 = instance.clone();
    }
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

Intuitively, I would expect those two cases to be equivalent, after all, cloning something shouldn't have side effects, Sure enough, if you ran these two tests you'd probably see them both pass. The troubling thing is that the second one doesn't always pass! It contains a race-condition, and actually fails some portion of the time.

Here's what happens:

  • We call Clone on an Instance, which since it's derived, forwards to the impls given by Instance's fields, all of which shallowly clone. In other words, we now have a second Instance object that still points to all the same data that the first one did.
  • The scope ends and we drop _instance2.
  • We call the Drop impl for Instance.
  • We send a kill-signal to the shared action-listener thread. It will receive this in between 0-1000 ms.
  • We send an action using the Sender owned by instance1.
  • If the thread was killed in time
    • The channel is disconnected and we panic!.
  • Otherwise
    • The test passes.

I believe that this bug occurs because of a conflict of assumptions:

  • That it is valid to implement Clone for Instance because they don't logically-own the thread and the state.
  • That it is valid to kill the thread and valid state when an Instance is dropped because they do logically-own the state.

I checked, and this Clone implementation is used in one spot in the code. I didn't look into it very deeply, but when I said:

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

that is what I meant.


The main change of this PR seems to be the removal of StateWrapper. I would expect that to re-instantiate the memory leak that got fixed by introducing it.

Why would there have been a memory leak? Arcs drop their contents. Was there an Arc cycle somewhere in the code?

I would guess that you were led down this path after seeing thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5 in the failing tests of #1701 - but that is a red herring!

Actually, no. I was working on an Instance refactor and I noticed that this Drop implementation combined with Clone was bound for failure.

So it is an intentional panic.

If that's true, then one of my assumptions above is wrong. Are we actually using panics for something other than fatal unforeseen logic errors?

Let me know if I'm making some mistake in my logic here.

timotree3 avatar Oct 22 '19 23:10 timotree3

I haven't tested your added tests yet, but some of the CI tests are failing with e.g. with the below. Also the branch is out-of-date. However, I guess you may want to wait for review e.g. about:

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

https://circleci.com/gh/holochain/holochain-rust/43055?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

#    Compiling jsonrpc-http-server v14.0.1 (https://github.com/holochain/jsonrpc?branch=broadcaster-getter#3a043aba)
# error[E0592]: duplicate definitions with name `broadcaster`
#    --> /root/project/.cargo/git/checkouts/jsonrpc-b3276f041818a130/3a043ab/ws/src/server.rs:42:2
#     |
# 42  |       pub fn broadcaster(&self) -> Broadcaster {
#     |  _____^
# 43  | |         Broadcaster {
# 44  | |             broadcaster: self.broadcaster.clone(),
# 45  | |         }
# 46  | |     }
#     | |_____^ duplicate definitions for `broadcaster`
# ...
# 148 |       pub fn broadcaster(&self) -> ws::Sender {
#     |  _____-
# 149 | |         self.broadcaster.clone()
# 150 | |     }
#     | |_____- other definition for `broadcaster`
# 
# error: aborting due to previous error
# 
# For more information about this error, try `rustc --explain E0592`.
# error: Could not compile `jsonrpc-ws-server`.
# warning: build failed, waiting for other jobs to finish...
# error: failed to compile `hc v0.0.32-alpha2 (/root/project/crates/cli)`, intermediate artifacts can be found at `/root/project/target`

https://circleci.com/gh/holochain/holochain-rust/43047?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link:

> wat2wasm /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wat -o /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wasm
Created DNA package file at "dist/app_spec.dna.json"
DNA hash: QmPELfsnYiJnBxPbVQZdwCM98Rqr1j9ymXXNCAoTiEjbsv
Spawning conductor0 process...
Conductor0 process spawning successful
events.js:170
      throw er; // Unhandled 'error' event
      ^

Error: spawn ./.cargo/bin/holochain ENOENT
    at Process.ChildProcess._handle.onexit (internal/child_process.js:247:19)
    at onErrorNT (internal/child_process.js:429:16)
    at processTicksAndRejections (internal/process/task_queues.js:81:17)
    at process.runNextTicks [as _tickCallback] (internal/process/task_queues.js:56:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:880:11)
    at internal/main/run_main_module.js:21:11
Emitted 'error' event at:
    at Process.ChildProcess._handle.onexit (internal/child_process.js:253:12)
    at onErrorNT (internal/child_process.js:429:16)
    [... lines matching original stack trace ...]
    at internal/main/run_main_module.js:21:11
Exited with code 1

jamesray1 avatar Oct 31 '19 05:10 jamesray1

Neither of those things seem to have anything to do with my change. Maybe a conflict was mis-merged? That would explain the duplicate function. I'll rebase my branch and get rid of these merge commits.

timotree3 avatar Oct 31 '19 14:10 timotree3

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop? See #1823.

jamesray1 avatar Nov 01 '19 02:11 jamesray1

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop?

I believe that you just got very lucky. The test that was added in that commit can still spuriously pass, it tries a data-racy thing 100 times so if they all spuriously succeed then it will succeed as well.

I just ran them on my computer and they failed which means that they can fail but by chance it passed that time. If you want to check surefire is this is an issue, I suggest you try running the updated tests created in 35919a0. They don't pass spuriously.

timotree3 avatar Nov 04 '19 17:11 timotree3