dora icon indicating copy to clipboard operation
dora copied to clipboard

Add support for node integration testing

Open phil-opp opened this issue 4 months ago • 1 comments

How it works

  • Set the DORA_TEST_WITH_INPUTS env variable with the path to your input JSON file
  • (Optional) Set the DORA_TEST_WRITE_OUTPUTS_TO env variable with the path where the outputs should be written. If not set, dora will write a outputs.jsonl file next to the given inputs file
  • Start the node executable/script

The node will be run as usual, but its event channel will be filled from the given inputs JSON file. No connection to a dora daemon will be made.

Input JSON file format example:

{
     // ID of the node
    "id": "foo",
    // defines the events that the node should receive
    "events": [
        {
            // specifies when the event arrives (seconds since node start)
            "time_offset_secs": 0.7,
            // type of the event (supported types are `Input`, `Stop`, `InputClosed`, `AllInputsClosed`
            "type": "Input",
            // input ID
            "id": "tick"
            // optional: `data` field with input data
        },
        {
            "time_offset_secs": 0.9,
            "type": "Input",
            "id": "tick"
        },
        {
            "time_offset_secs": 1.2,
            "type": "Stop"
        }
    ]
    // other supported fields: name, description, args, env, outputs, inputs, send_stdout_as (they all behave like in dataflow.yaml)
}

Output JSON file format example:

{"id":"random","data":9267023440904143729,"time_offset_secs":0.700793541}
{"id":"random","data":5753749540645363621,"time_offset_secs":0.900897584}

TODO:

  • [ ] Documentation
    • [ ] API docs
    • [ ] dora-rs.ai docs
  • [ ] add some tests for our examples and use them on our CI
  • [x] take a look at arrow_integration_test JSON format -> this might be better suitable than our custom input json format
  • [ ] add option (via env variable) to write out received inputs as inputs.json files during normal dataflow operation -> to make creating files with complex input data easier
  • [x] add option (via env variable) to omit time offsets in output formats -> to make them diff-able with expected outputs (the time offsets are a bit different on each run)
  • [x] ~use ArrowTestUnwrap format for outputs.jsonl~ (not possible because of https://github.com/apache/arrow-rs/issues/8684)
    • [x] instead: Include data type in output JSON file

phil-opp avatar Oct 15 '25 15:10 phil-opp

I opened https://github.com/apache/arrow-rs/pull/8737 to add support for binary decoding to arrow-json.

phil-opp avatar Oct 29 '25 14:10 phil-opp

Could you add an example of a node been integration tested?

haixuanTao avatar Nov 20 '25 03:11 haixuanTao

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

phil-opp avatar Nov 20 '25 09:11 phil-opp

I also added detailed docs in https://github.com/dora-rs/dora/pull/1163/commits/618689d7cdc6f42eabda82da90fcf6b0a24f1a6a

phil-opp avatar Nov 20 '25 10:11 phil-opp

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

haixuanTao avatar Nov 20 '25 11:11 haixuanTao

Also could you put somewhere within the test folder the way you generated the test data?

haixuanTao avatar Nov 20 '25 11:11 haixuanTao

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

  • cargo build -p rust-dataflow-example-node
  • DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

phil-opp avatar Nov 20 '25 11:11 phil-opp

Also could you put somewhere within the test folder the way you generated the test data?

Done in 554ae3a8

phil-opp avatar Nov 20 '25 11:11 phil-opp

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

  • cargo build -p rust-dataflow-example-node
  • DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

I don't think that as a user of dora, I would be able to reproduce the node test on my own on my own codebase.

Would it be possible to make node integration test be something like within the main.rs of the node:

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_integration() {
        let input_json = ...
        let output_json = ...
        
        // Maybe add them to env variable 
        // ... 

        let result = main();

        assert_eq! ...
  }
}

So that it's easier to read but also use native testing that can be used with: cargo test -p rust-dataflow-example-node

P.S: Edited for better understanding

haixuanTao avatar Nov 21 '25 01:11 haixuanTao

I don't think that there is a way around using Command to spawn a subprocess. Setting an env variable for the current process is an unsafe operation and I don't think that we should encourage users to use unsafe in tests. Also, the main function does not exists when building a crate in test mode (it is replaced by a test handling main function).

What we can do is using cargo run instead of build + manual run. This way we can also avoid the manual exe extension handling and specifying the target/debug/x path.

Moving the test to the respective crate is also possible of course, we just have to duplicate the test function then. But I agree that it makes it easier to copy a test for your own node.

I look into it, thanks for the suggestion!

phil-opp avatar Nov 21 '25 06:11 phil-opp

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

I find using cargo run to be less readable and harder to use overall as we can see within the current run.rs file.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

p.s: Edit env variable to compile time env variable

haixuanTao avatar Nov 21 '25 06:11 haixuanTao

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

We can of course also add some DoraNode::init_with_test_inputs function, but this way the main function is not part of the test anymore.

phil-opp avatar Nov 21 '25 07:11 phil-opp

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

But wouldn't this be way harder to grok, more error prone and harder to develop with than just using a test as a library?

Especially for people beginner with rust and cargo.

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

Yeah, but can't you just call the whole main() function within a main.rs test section? Just like any other test in rust?

haixuanTao avatar Nov 21 '25 08:11 haixuanTao

I pushed 05eed12 to simplify the test run function to use cargo run.

I also added a DoraNode::init_testing function in 6e138e3, which can be used to write library-style tests. It returns a custom DoraNode and EventStream, so you cannot run the main function directly. However, you can move most of the main function to a run(node, event_stream) function and then call that.

I used the new init_testing feature in d274190 to add some internal testing to the rust-dataflow-example-node crate. I also added a self-contained test_sample_output function that shows how to write test cases with loading/writing external files.

phil-opp avatar Nov 21 '25 10:11 phil-opp

I'm not against the proposed changes but this seems to requires user to learn about init_testing that does not seems to be easy to understand.

I think that what we're trying to do is really close to what cargo nextest is doing natively: https://github.com/nextest-rs/nextest?tab=readme-ov-file . Especially the section on: https://nexte.st/docs/configuration/env-vars/#altering-the-environment-within-tests

I would be in favor of instead of rewriting our test use nextest with a small bin test section that is native in rust:

// test main
#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_status_node_main_1() {
        set_var(DORA_TEST_WITH_INPUTS_1); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
    
    #[test]
    fn test_status_node_main_2() {
        set_var(DORA_TEST_WITH_INPUTS_2); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
}

And then:

cargo nextest run -p rust-dataflow-example-node

haixuanTao avatar Nov 21 '25 13:11 haixuanTao

So you think it's easier for users to learn about and understand a third-party testing tool than to call a function? This seems unlikely to me.

In general, one goal of Dora is support the standard build and test frameworks for each language, so that users don't need to learn extra tools. Removing support for cargo test would be directly against that goal.

I also don't understand how set_var can become safe, just by using cargo nextest. As far as I understand it, nextest is just an alternative test runner that is compatible with standard Rust tests. So the test functions would still need to call the unsafe std::env::set_var function, right?

The docs you linked just seems to say that set_var won't invoke undefined behavior under the right circumstances, if using cargo nextest. That seems like a very weak guarantee. To me it sounds like there is still undefined behavior if you accidentally run cargo test instead of cargo nextest, which seems quite dangerous.

phil-opp avatar Nov 21 '25 15:11 phil-opp

I pushed another improvement in 68723bbc. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

phil-opp avatar Nov 21 '25 15:11 phil-opp

I pushed another improvement in 68723bb. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

ok this works for me.

haixuanTao avatar Nov 24 '25 09:11 haixuanTao