ibind icon indicating copy to clipboard operation
ibind copied to clipboard

[Enhancement] Configure unit tests to run during CI action and set a goal for test coverage

Open weklund opened this issue 8 months ago • 19 comments

Describe Enhancement

Improve the reliability of IBind by revamping our current test suite, running those tests in CI, and collecting the code coverage so we can work towards a goal.

Things I'd like to cover and align with folks on in subsequent comments in this issue:

  1. Establish Clear Test Patterns. Implement a standard pattern for our unit and integration tests.

  2. Improve Test Coverage. Add tests for components we don't cover yet, particularly core client logic (ibind/base, ibind/client) and ibind/oauth, have robust unit and integration tests.

  3. Validate current tooling. Should we stick with unittest, or adopt pytest? Don't want to make a hasty decision here so happy to dive into it.

  4. Automate Testing in CI. Configure the GitHub Actions workflow to automatically run all tests in our PRs. This will be great to have better confidence that we have extensive testing of IBind, and those test suites run against all the python versions we support and both OSs.

  5. Implement Code Coverage Tracking. Integrate with Codecov.io (or a similar service) to monitor test coverage and identify untested code paths. Align on a coverage goal to slowly strive for.

Why?

  • Catch bugs earlier in the development cycle.
  • Provide greater assurance that changes don't break existing functionality.
  • Speed up development since Unit tests provide faster feedback.
  • Enhance maintainability, well-structured tests are easier to understand, modify, and debug IBind's implementation.
  • Help contributors with having clear testing practices and automated checks will lower the barrier for community contributions.

Will post specific ideas for each in below comments.

weklund avatar Apr 11 '25 00:04 weklund

  1. Testing Princples:

1.a Have a test isolate a single unit A unit test must focus on a single, small piece of code (a function, a method, or sometimes a small class) in complete isolation. Tests should verify that given specific inputs, the function produces the expected output. We don't need to worry about external systems interfering.

1.b Structure for Clarity (Arrange-Act-Assert) AAA: The comments clearly delineate the Arrange, Act, and Assert phases within each test, making the test flow easy to follow.

1.c Use Descriptive Naming Focus on the behavior being tested, not the underlying implementation.

def test_ws_client_on_error_logs_error_message_and_attempts_reconnect():
    # ... test implementation ...
    pass

The names clearly state the method (on_message), the condition/trigger (implicitly, receiving a message), and the expected outcome (updates_last_received_time).

Now we can also have the discussion on test name format. I prefer trying to use natural language as much as possible, so I would personally name that test:

class WsClientTests

def it_should_send_error_message_and_attempt_reconnect_when_error():
    # ... test implementation ...
    pass

Some of the names could get verbose, but overall I think it makes it easier to grok what the test is doing. Here's some test names from my personal project for reference.

test/test_main.py::it_should_log_an_error_when_parsing_fails PASSED                                  [ 33%]
test/test_main.py::it_should_determine_fixed_actions_correctly[Band Change-CLOSE] PASSED             [ 35%]
test/test_main.py::it_should_determine_fixed_actions_correctly[Price Crossing XL-CLOSE] PASSED       [ 36%]
test/test_main.py::it_should_determine_enter_action_for_new_trade_design PASSED                      [ 38%]
test/test_main.py::it_should_raise_value_error_for_unknown_action PASSED                             [ 40%]
test/test_main.py::it_should_return_healthy_status_when_dependencies_are_ok PASSED                   [ 41%]
test/test_main.py::it_should_have_correct_cors_headers PASSED                                        [ 43%]
test/test_main.py::it_should_return_degraded_status_when_tickle_data_fails PASSED                    [ 45%]
test/test_main.py::it_should_process_open_action_successfully PASSED                                 [ 46%]
test/test_main.py::it_should_process_close_action_successfully PASSED                                [ 48%]
test/test_main.py::it_should_handle_no_signal_from_analysis PASSED                                   [ 50%]
test/test_main.py::it_should_handle_analysis_exception PASSED                                        [ 51%]
test/test_main.py::it_should_handle_execution_exception PASSED                                       [ 53%]
test/test_main.py::it_should_handle_parsing_exception_explicitly PASSED                              [ 55%]
test/utils/logging/test_cloudwatch_logger.py::it_should_initialize_cloudwatch_logger_successfully PASSED [ 56%]

Now I'm just posing an option here for test name guidance. Would be interested to get thoughts here, but I'm happy to stick with the 'test_*' pattern I see the tests currently.

1.d Keep Tests Small and Focused

Have each test function assert over one target outcome. Break up the various paths a function can take with each a different test.

Here's an contrived example of what that could look like applying these principles.

def it_should_process_close_action_successfully(mock_run_analysis, test_client):
    """Test the webhook endpoint for a 'CLOSE' action (e.g., Exit Signal)."""
    # Arrange
    payload = "MES! Exit Signal, Price = 4500.75"

    # Act
    response = test_client.post(
        "/webhook",
        content=payload,
        headers={"Content-Type": "text/plain"},
    )

    # Assert
    assert response.status_code == 200
    data = response.json()
    assert data["status"] == "received_and_processed"
    details = data["details"]
    assert details["status"] == "ignored"  # Check ignored status for CLOSE action
    assert "Closing logic pending" in details["message"]

    # Ensure analysis was NOT called for CLOSE action
    mock_run_analysis.assert_not_called()


def it_should_handle_parsing_exception_explicitly(mock_parse, test_client):
    """Test the webhook endpoint specifically for parsing exceptions."""
    # Arrange
    payload = "Invalid Payload"

    # Act
    response = test_client.post(
        "/webhook",
        content=payload,
        headers={"Content-Type": "text/plain"},
    )

    # Assert
    assert response.status_code == 400
    data = response.json()
    assert "detail" in data
    assert "Failed to parse webhook payload" in data["detail"]
    assert "Parsing error" in data["detail"]

Let me know what you think! 😁

weklund avatar Apr 11 '25 02:04 weklund

Hey @weklund great suggestions all across the board 👏 Let me address them:

Establish Clear Test Patterns. Implement a standard pattern for our unit and integration tests.

As in to distinguish between the unit and integration tests? To avoid ambiguity? Or is this something else you're talking about?

Improve Test Coverage

+1👍✨

Validate current tooling. Should we stick with unittest, or adopt pytest? Don't want to make a hasty decision here so happy to dive into it.

I'm fine with unittest, although open to adapting to pytest, which is meant to be easier and better. I've never seen a sufficiently groud-breaking argument to switch, but I'm not against it. Given that IBind's tests are already written in unittest, switching would mean having to rewrite all of them first. In some cases this would be trivial, in others - such as the websocket asynchronisity tests - would be tricky. We should discuss if there's anything in pytest that would improve over unittest substantially to justify the rewrite. Cause having two test systems wouldn't make sense, would it?

Automate Testing in CI. Configure the GitHub Actions workflow to automatically run all tests in our PRs.

Sure, that shouldn't be too hard with the setup you've implemented, right?

Integrate with Codecov.io (or a similar service) to monitor test coverage and identify untested code paths.

Sure. I run it with internal coverage.py and we're currently at 91% files, 76% lines.

I'd feel that it could be reasonable to wait with publishing these on GH before we can get these numbers higher though - thoughts?

1.a Have a test isolate a single unit A unit test must focus on a single, small piece of code (a function, a method, or sometimes a small class) in complete isolation. Tests should verify that given specific inputs, the function produces the expected output. We don't need to worry about external systems interfering.

That sounds about right. Maybe 'a small class' would already sound like an integration test to me, unless 'small' means 'has only one method'.

AAA: The comments clearly delineate the Arrange, Act, and Assert phases within each test, making the test flow easy to follow.

Sure. I personally rarely find these comments useful when reading others' tests, as its quite self-explanatory which parts of the tests do which job after a quick glance, but I'm not against it if it would make your life easier.

1.c Use Descriptive Naming Focus on the behavior being tested, not the underlying implementation.

Good suggestion 👍 I've never paid too much attention to test naming, so this is definitely an accurate observation and a place where I could improve.

Now we can also have the discussion on test name format. I prefer trying to use natural language as much as possible

Two points here:

  • if we stick with unittest, I think the test names need to start with word 'test'
  • if all tests start with it_should, it sounds like we could just ditch it and call the tests with that in_should in our hearts, ie:
test/test_main.py::log_an_error_when_parsing_fails PASSED                                  [ 33%]
test/test_main.py::determine_fixed_actions_correctly[Band Change-CLOSE] PASSED             [ 35%]
test/test_main.py::determine_fixed_actions_correctly[Price Crossing XL-CLOSE] PASSED       [ 36%]
test/test_main.py::determine_enter_action_for_new_trade_design PASSED                      [ 38%]
test/test_main.py::raise_value_error_for_unknown_action PASSED                             [ 40%]
test/test_main.py::return_healthy_status_when_dependencies_are_ok PASSED                   [ 41%]

Keep Tests Small and Focused

In general, yes, 100% agreed.

Two asterisks:

  1. 'small and focused' is a relative term. A 'small and focused' unit test will have different length than a 'small and focused' integration test.

  2. The following isn't really to try to convince you otherwise, more to bring up some experience in this domain: In practice I find that there are a few cases where not keeping each test small and focused is better long term, namely when there are tests that have a lot of long set up and assert code that is equivalent with other tests.

    This leads to large chunks of essentially duplicated code (or an additional setup system within the test to avoid duplicates), which makes tests less maintainable and less likely to be updated when codebase changes. These are edge cases, but I've seen areas where 'keeping tests small and focused' translated to writing 4-5 tests each 50+ lines of code which were equivalent save for two lines. Merging these into a single test with multiple focus points helps improve maintainability, despite disobeying this guideline.

    So, my suggestion would be to use this guideline, but be open to sparsely accepting cases where one test has a slightly broader focus if it can be showed that sticking to the guideline would bloat the test file substantially.


Overarching this whole topic - where should we crystalise these decisions? TESTING.md?


On a parallel note: we should discuss which tests to write. I used to cover the whole codebase with unit tests in my projects, and add only a few integration/e2e tests. I've learned that this is not the most useful way to cover a project with tests. In my experience, having a lot more integration tests and less unit tests makes a lot more sense, as integration tests are:

  • easier to write - less setup, less mocking
  • easier to write correctly - less mocking means fewer places to get something wrong, where a bad mock could silently invalidate the entire test
  • easier to maintain - the underlying assumption that integration tests run over several units of code means there is likely going to be less of them
  • more meaningful - testing actual interactions between units, rather than made up isolated scenarios, means it's more likely to catch real bugs. Ie. integration tests are more likely to fail than unit tests (which is good!)
  • it's harder to write useless integration tests than unit tests, where by 'useless' I mean tests that raise code coverage but don't meaningfully contribute to verifying the business logic and catching bugs

On the flip side:

  • they're harder to debug when they fail, since more code needs to be analysed
  • they're harder to generate using currently available AI tools (this comment will age like fine wine)

To me unit tests make more sense when there is a crucial function that will be used in multiple places. In practice these usually are functions that do some kind of math or universally-useful logic. A good example would be the py_utils functions in IBind: ensure_list_arg, execute_in_parallel and wait_until.

Happy to hear your thoughts on this 👍

Voyz avatar Apr 16 '25 11:04 Voyz

I'm coming back to this! I should have lots more capacity starting next week :)

weklund avatar Apr 28 '25 14:04 weklund

Alright. Let's dig in :) - Sorry for the delay you had a really thoughtful and detailed response so I wanted to give the time and attention it deserves.

re: finding the balance between unit and integration tests:

I completely agree with your points on the value of integration tests: they are key for verifying that components work together correctly, catch bugs at interaction points, and often mirror real-world usage more closely. They absolutely key for testing correct behavior.

I actually think we need much more unit tests :) Here's a few thoughts:

  1. Precision and Isolation: When a well-written unit test fails, it points directly to the specific piece of logic that's broken. This dramatically speeds up debugging compared to an integration test where the failure could be in one of many interacting parts. For instance, if we had a subtle off-by-one error in a calculation within py_utils, a unit test for that specific function would catch it immediately.

  2. Comprehensive Edge Case Coverage: Unit tests excel at covering the myriad of edge cases, boundary conditions, and specific failure scenarios for individual functions or methods (e.g., empty inputs, nulls, invalid types, expected exceptions). It's often complex and cumbersome to set up all these specific states through an integration test. Think about the ensure_list_arg function – we'd want unit tests for when it receives None, an empty list, a single item, a list, etc., and with keyword vs. positional arguments. Replicating all those specific preconditions in an integration test for every component that uses ensure_list_arg would be inefficient.

  3. Refactoring Safety Net: A strong suite of unit tests provides a tight safety net when refactoring. If you change the internals of a function but its unit tests still pass, you have high confidence you haven't broken its core contract. This, assuming you have asserted over the correct function behavior and not it's underlying logic implementation. This is harder to guarantee with integration tests alone.

  4. Design Driver: Often, the process of writing unit tests first (or alongside development) encourages better, more modular design in the production code itself. If a piece of code is hard to unit test, it's often a sign it's doing too much or has too many tight dependencies.

  5. Coverage Gaps: You mentioned 76% line coverage. While file coverage is high, that 24% of un-exercised lines often hides in these specific edge cases or error handling paths that unit tests are perfectly suited to target. Boosting this line coverage will likely come from more focused unit tests.

Now I really did not want to use AI for this illustration, but I hope it proves the point (some of these are pretty pedantic) that we can use more unit tests to ensure our code is robust, handles failures and edge cases, and gives us confidence if we ever need to refactor:


Test Coverage Analysis for py_utils.py

The current test suite (test_py_utils_u.py) provides partial coverage for ibind/support/py_utils.py. While some core functionalities are tested, a significant number of utility functions and specific scenarios within covered functions remain untested, many of which are currently marked with # pragma: no cover.

Covered (at least partially):

  1. ensure_list_arg:
    • Tested with list and non-list inputs for both positional and keyword arguments.
    • Tested the TypeError when a required argument is missing (this primarily tests the underlying function's requirement, as ensure_list_arg doesn't make arguments optional).
  2. execute_in_parallel:
    • Tested with both dict and list for the requests parameter.
    • Tested basic functionality (correct results, function call count).
    • Tested rate-limiting behavior.
  3. execute_with_key:
    • Tested successful execution.
    • Tested exception handling (ensuring it returns the exception).
  4. wait_until:
    • Tested when the condition is met immediately.
    • Tested when the condition is not met (timeout).
    • Tested logging of timeout_message.
    • Tested that it waits for approximately the timeout duration.

Not Covered or Potentially Insufficiently Covered:

Many functions in py_utils.py are marked with # pragma: no cover. While this excludes them from coverage reports, robust unit tests are still needed to verify their behavior.

  1. ensure_list_arg:
    • Multiple arguments: The decorator supports *arg_names, but tests only cover a single arg_name. A test case for ensure_list_arg('arg1', 'arg2') is needed.
    • Argument not present: The behavior when an arg_name specified in ensure_list_arg does not exist as a parameter in the decorated function (currently raises ValueError) should be explicitly tested.
    • Interaction with *args and **kwargs in decorated function: How does it behave if arguments to be listified are part of the decorated function's *args or **kwargs? (More complex, depends on inspect.signature).
  2. VerboseEnumMeta and VerboseEnum:
    • Entirely untested (marked pragma: no cover).
    • Key Scenarios for VerboseEnumMeta:
      • __getitem__ (e.g., MyEnum['VALUE_STRING']): Test successful lookup (case-insensitivity, whitespace stripping) and AttributeError for invalid keys.
      • from_string: Explicitly test with valid and invalid strings.
      • values: Test that it returns a list of enum member values.
    • Key Scenarios for VerboseEnum:
      • __str__: Test it returns self.value.
      • __repr__: Test the string representation.
      • __reduce_ex__: Important for pickling.
      • __lt__: Test comparison.
      • to_json: Test JSON representation.
      • copy: Test copying.
  3. execute_in_parallel:
    • Empty requests: Behavior with an empty list or dict (should return empty list/dict).
    • max_workers parameter: Test impact of varying max_workers (harder to assert precisely without inspecting thread usage).
    • Function func raises an exception: While execute_with_key handles this, an explicit test where func consistently raises exceptions for some/all requests in execute_in_parallel is needed to ensure correct aggregation of results (e.g., a dict/list of exceptions).
  4. filter_none:
    • Entirely untested (marked pragma: no cover).
    • Key Scenarios:
      • Empty dictionary.
      • Dictionary with no None values.
      • Dictionary with top-level None values.
      • Dictionary with nested None values.
      • Input that is not a mapping (should return unchanged).
      • Dictionary with mixed types, including other collections (e.g., lists) that should remain unchanged.
  5. TimeoutLock:
    • Entirely untested (marked pragma: no cover).
    • Key Scenarios:
      • Successful acquire and release.
      • Timeout when trying to acquire an already held lock (requires threads).
      • Reentrant behavior (same thread acquiring multiple times).
      • Usage as a context manager (with TimeoutLock():).
  6. exception_to_string:
    • Entirely untested (marked pragma: no cover).
    • Key Scenarios:
      • Simple exception.
      • Exception with a cause (raise NewException from OldException).
      • Exception with a long traceback.
  7. make_clean_stack:
    • Entirely untested (marked pragma: no cover).
    • Requires setting up a call stack and asserting that specific frames are filtered.
  8. wait_until:
    • condition raises an exception: Current implementation propagates the exception. This behavior should be documented and possibly tested.
  9. tname:
    • Untested (marked pragma: no cover). A basic test for its output format.
  10. params_dict:
    • Untested.
    • Key Scenarios:
      • Only required provided.
      • Only optional provided (with None and non-None values).
      • Both required and optional provided.
      • preprocessors used: Test that preprocessor functions are called.
      • optional values that are [None] (special case).
      • All inputs None or empty (test for None or {} return based on logic).
  11. print_table:
    • Untested. Requires redirecting sys.stdout to test.
    • Key Scenarios: Empty input, input with data, column_order specified/unspecified.
  12. patch_dotenv:
    • Untested. Involves side effects (patching another module).
    • Testing might involve:
      • Mocking the dotenv module.
      • Checking if dotenv.load_dotenv is replaced.
      • Verifying warning issuance.
      • Verifying original load_dotenv is called.
      • Testing ImportError case (dotenv not installed).

Summary and Recommendations:

The test suite for py_utils.py requires significant expansion to ensure the reliability and correctness of its diverse utility functions. Many functions, particularly those marked with # pragma: no cover, currently lack any unit tests.

High Priority Areas for New Tests:

  1. VerboseEnumMeta and VerboseEnum: Foundational for robust enumerations.
  2. filter_none: Common utility needing tests for recursive behavior and various input types.
  3. params_dict: Logic for dictionary construction with optional parameters and preprocessors.
  4. TimeoutLock: Critical for concurrency; timeout and reentrant behavior must be tested.
  5. exception_to_string: Ensure correct formatting of exceptions, especially chained ones.

Suggestions for Existing Tests:

  • ensure_list_arg: Add tests for multiple arg_names and behavior when an arg_name is not in the function signature.
  • execute_in_parallel: Add tests for empty requests input and when the target function consistently raises exceptions.

Happy to be challenged here!

weklund avatar May 13 '25 18:05 weklund

This leads to large chunks of essentially duplicated code (or an additional setup system within the test to avoid duplicates), which makes tests less maintainable and less likely to be updated when codebase changes.

This is absolutely a great concern! Yeah I've had way too many scenarios where I need to test a file that's 100 lines, and the test file is over 500.

While I'm pretty confident we won't be able to find a great solution here, there are ways to mitigate this:

  1. Use great helper functions to do the setup and mocking for groups of tests, so theoretically we get closer to less duplication and only have the Act/Assertion code.
  2. If we were using pytest, we could use parameterized tests. In unittest, we can achieve similar results with subTest or by dynamically generating test methods, though it requires a bit more setup. Parameterized can reduce boilerplate code significantly. Here are some examples how to parametrize and parametrize example
  3. Hopefully if the code changes, the forcing function for maintainability is just failing tests that we see when we submit PRs. PRs should be blocking until they're fixed.

weklund avatar May 13 '25 19:05 weklund

re: defining a 'unit'

That sounds about right. Maybe 'a small class' would already sound like an integration test to me, unless 'small' means 'has only one method'.

This is a good point for clarification. I think we may have different definitions of unit and integration tests so maybe we should baseline.

In my view, a unit test can absolutely cover a class. The key is isolation. If you're testing a class and all its external dependencies (other classes, services, databases, network calls) are mocked or stubbed out, you're still unit testing the logic within that class.

It becomes an integration test when you start testing it with its actual, live dependencies. My assumption of an integration test is that we would create a test suite that would actually call the IBKR API and assert that we have the correct contract with the API. An integration test should have zero mocking. My definition of integration is I think more of what you have defined in E2E tests.

At work we have a integration test suite where after merging into mainline we have the integration test suite run against test environment APIs and assert over all API calls, database entries, and even run the integration tests nightly (we call those canaries), so in the event an outside dependency (like the IBKR API) has changed underneath us we can catch it in failed integration tests. It's not that the tests failing is bad, is more of clear signal that something has changed when we would normally only find out until we attempted to make that call ourselves.

So, a small, well-defined class with a clear responsibility can and should be unit tested with mocking.

Looking through all the integration test code we currently have in IBind, I actually consider those just more unit tests to be honest.

weklund avatar May 13 '25 19:05 weklund

Ideally, every time we resolve an issue from someone that was actually some edge case in the code, we should be able to write a new unit test for that to verify that we will never regress, and someone else runs into the same issue.

weklund avatar May 13 '25 19:05 weklund

PS: Looking at some of those bad input related unit tests recommended by AI, if we had the Pydantic models (I know I still need to write that up! 😅 ) we should be getting all that input handling for free, which will cut hundreds of lines of unit tests to get the coverage up.

weklund avatar May 13 '25 19:05 weklund

Great writeup @weklund and I appreciate that you challenge my existing approach 🙌 I've reflected on it and can see where I got some things wrong. More concretely though:

I think we may have different definitions of unit and integration tests so maybe we should baseline.

Well spotted. To clarify, my existing understanding - based on which I wrote the tests - was:

  • unit - you mock everything except the one block of code you're testing. Testing multiple methods of a class in a single test fall outside of that scope because it's no longer just 'one block of code'.
  • e2e - you mock nothing AND your test interacts with things outside your project, eg. database, online API, etc.
  • integration - everything else in between, ie. you allow application code to interact with itself, but there are no interaction outside of the project

If I understand your correctly, your definition is more like my integration & my unit -> unit tests, while my e2e -> integration tests.

I did a bit more reading on this, and it seems that:

  • it's fuzzy, hence please read any "correct" and "incorrect" words in this discussion as if in cursive and quotation marks
  • single classes are indeed unit tests - here my definition seems to have been incorrect
  • integration tests don't use external services, these indeed are called 'e2e' or 'application' or 'component' - here your definition seems to have been incorrect

https://circleci.com/blog/unit-testing-vs-integration-testing https://www.practitest.com/resource-center/article/unit-test-vs-integration-test/ https://www.testim.io/blog/unit-test-vs-integration-test/ https://www.twilio.com/en-us/blog/unit-integration-end-to-end-testing-difference https://www.functionize.com/blog/battle-of-testing-strategies-end-to-end-vs-integration-testing

Once again, the distinctions are fuzzy. How about we define it, concretely in context of IBind and Python, as:

  • unit - test one block of code; 'block' being defined as anything up to a Python module. Testing multiple methods of a class in a single test fall inside of that scope. Mock anything that comes from other modules.
  • integration - test code from multiple Python modules. Mock only external interactions.
  • e2e - test interactions with services outside IBind, eg. database, IBKR API, etc.

I purposefully selected 'a module' as a concrete line, hoping to help us decide whether we call a test unit or integration.

Thoughts?

Voyz avatar May 15 '25 09:05 Voyz

Let's reflect that if my integration & my unit -> your unit tests, while my e2e -> your integration tests, then in that write up you did on why we need more unit tests we're essentially both talking about the same thing. In that context, what I meant is that it's more useful to focus on tests that cover multiple interactions than isolated single block of code. I'm gonna roll with the proposed definition I made above, in context of which I'll ignore the points that originate in different definitions we've had until now:

  1. Precision and Isolation: When a well-written unit test fails, it points directly to the specific piece of logic that's broken. This dramatically speeds up debugging.

I partially disagree - it's faster, yes, but I don't think the increase is dramatic. Well written integration test also points towards the broken code, and although it takes longer to nail down the culprit, we're usually talking extra 10-30 seconds, not minutes or hours.

  1. [...] Think about the ensure_list_arg function

ensure_list_arg function is in fact one of the functions I wrote unit tests for 😉

  1. Refactoring Safety Net - whole point
  2. Design Driver - whole point

These points are true for any tests, and I agree with them.

  1. Coverage Gaps: You mentioned 76% line coverage. While file coverage is high, that 24% of un-exercised lines often hides in these specific edge cases or error handling paths that unit tests are perfectly suited to target.

Absolutely! We're on the same boat, I'd like to see that number higher.

Voyz avatar May 15 '25 09:05 Voyz

Now I really did not want to use AI for this illustration, but I hope it proves the point (some of these are pretty pedantic) that we can use more unit tests to ensure our code is robust, handles failures and edge cases, and gives us confidence if we ever need to refactor

A quick side note: The whole py_utils.py module is a great example of what I'd cover with unit tests (as per my original definition, but also having just revised it), hence I'm not sure analysing it contributes to the unit/integration discussion meaningfully - I always believed that we need unit tests here.

Your analysis nevertheless highlights a very important point that I've briefly touched before: avoiding writing useless tests.

Should this code be tested?

def execute_with_key(key, func, *args, **kwargs):
    try:
        return key, func(*args, **kwargs)
    except Exception as e:
        return key, e

I firmly believe that the answer is no and that writing tests for it would be a seemingly-benign mistake that would just add clutter and increase complexity of the repository without contributing anything in terms of robustness or maintainability - though I'm open to hear if there's a good reason to believe otherwise.

Your analysis also suggests other units I wouldn't cover with tests:

  • filter_none
  • exception_to_string
  • make_clean_stack
  • tname
  • print_table

I think the 'art of covering a repo with automated tests' is about striking a balance between robustness and maintainability. Spitting out hundreds of tests only for them to never be maintained and make refactoring difficult is not a good balance. Hence, I think that purposefully choosing what needs to be covered is important; I consider it along the lines of other 'good programming practices', sitting next to not writing an inline comment above every line of code, or not giving variables overly-verbose long names - it's subjective and decided on case-by-case basis.

On the flip side, I do agree that not thoroughly covering ensure_list_arg, VerboseEnumMeta, TimeoutLock and params_dict is a mistake on my end.

Lastly, your analysis only focuses on test_py_utils_u.py. That's an incomplete view, as the module is also covered with tests from integration tests.


To reach some conclusion based on all these observations: my view is that when it comes to writing automated tests, unless we're equipped with sufficiently large Q&A and developer resources (which we aren't here) it's important to sit down and ask ourselves what needs to be tested, then make a decision and be open to other developers challenging that decision in the future - just like you did right now. I think it would be a mistake to follow that AI analysis' suggestion and just cover everything everywhere with exhaustive tests.

Voyz avatar May 15 '25 09:05 Voyz

I read and agree to pretty much everything else you wrote, thanks for giving your thoughts 👍

If you could, let's loop back to whether rewriting everything in pytest would be worth the effort. Once again, my reply was:

I'm fine with unittest, although open to adapting to pytest, which is meant to be easier and better. I've never seen a sufficiently groud-breaking argument to switch, but I'm not against it. Given that IBind's tests are already written in unittest, switching would mean having to rewrite all of them first. In some cases this would be trivial, in others - such as the websocket asynchronisity tests - would be tricky. We should discuss if there's anything in pytest that would improve over unittest substantially to justify the rewrite. Cause having two test systems wouldn't make sense, would it?

Thoughts on that?

Voyz avatar May 15 '25 09:05 Voyz

Great thoughts!

Your analysis also suggests other units I wouldn't cover with tests:

filter_none exception_to_string make_clean_stack tname print_table I think the 'art of covering a repo with automated tests' is about striking a balance between robustness and maintainability.

Ah these are great examples! I think in these cases, it would be intentional to set our coverage goal at 90-95% line coverage. That way we can decide the places where it doesn't make sense to have tests for.

AI analysis' suggestion and just cover everything everywhere with exhaustive tests

Ideally, a pydantic model would reject a lot of the bad input cases those tests are checking for at runtime. Looking at the analysis again, I would hope that most of those exhaustive test cases would go away, while still being confident in handling those cases.

weklund avatar May 15 '25 13:05 weklund

So there is the possibility to run both for awhile, as running the pytest command can run both pytest and unittest tests and get the same pass/fail reports and coverage. Plus we could keep using MagicMock with pytest if we chose.

So given this, if we were to align here we could work incrementally:

  1. Add the coverage and integrate with codecov so we get the badge status as our goal posts for improvement.
  2. Install pytest and run it without changing any unittest tests.
  3. Work in steps on how best to convert each file. I wonder if the websocket tests wouldn't be so bad given we're currently mocking WebSocketApp for on_open on_message etc etc. Because of the test methods themselves being mocked, I think the tests are all run synchronously? Happy to be corrected. I'm more worried about the RaiseLogsContext and SafeAssertLogs - I think we can use the Pytest fixture caplog to help us here (PyTest fixtures are really cool btw) but I think we would need to experiment some here to figure out how best to do this.

weklund avatar May 15 '25 17:05 weklund

it would be intentional to set our coverage goal at 90-95% line coverage

Unless I'm mistaken, we'd still aim for 100%. The # pragma: no cover makes code blocks excluded from coverage completely. In other words, we exclude some code, then aim for 100% coverage on the rest. Unless you're talking about how much % can/should be excluded, in which case I don't know of any standards or goals. I think it's quite case dependant, and we should cross-check any pragma: no covers to validate the decisions to put them.

So there is the possibility to run both for awhile, as running the pytest command can run both pytest and unittest tests and get the same pass/fail reports and coverage. Plus we could keep using MagicMock with pytest if we chose.

I've checked that and indeed, all existing tests for with pytest. Great news!

Add the coverage and integrate with codecov so we get the badge status as our goal posts for improvement.

You really think we should add the badge already? I'm cautious about showing something below 90%, as it may defer some users.

I wonder if the websocket tests wouldn't be so bad given we're currently mocking WebSocketApp for on_open on_message etc etc.

It's a bit more than just these two, when it comes to starting, restarting and recovery. It's complicated but crazy, it's doable.

Because of the test methods themselves being mocked, I think the tests are all run synchronously?

It is synchronous, yes.

I'm more worried about the RaiseLogsContext and SafeAssertLogs - I think we can use the Pytest fixture caplog to help us here (PyTest fixtures are really cool btw) but I think we would need to experiment some here to figure out how best to do this.

Yupp, that may need a completely different approach. In all fairness, SafeAssertLogs is a wrapper that ensures running tests works nicely with the singleton logging module. If there's a clean way to do this in pytest we may not need it.


Okay then, pytest it is 👏 How do you propose we proceed? I was thinking:

  • write TESTING.md, establish the distinction between unit/integration tests, the Arrange, Act, and Assert phases, naming standards, etc.
  • add testing to CI/CD
  • establish timeline for reaching the 100% target, assign segments to test between us

Voyz avatar May 19 '25 12:05 Voyz

hey @weklund just wanted to follow up on this - it would be awesome if we could make some progress here with tests 👍

Voyz avatar Jun 23 '25 11:06 Voyz

Here's a draft PR. I could make some more adjustments but happy to take feedback now! :D

https://github.com/Voyz/ibind/pull/121

weklund avatar Jun 29 '25 19:06 weklund

You really think we should add the badge already? I'm cautious about showing something below 90%, as it may defer some users.

I can understand the hesitation, as it shows the quality bar (or lack thereof)

2 thoughts here:

  1. I may have mentioned it already, but I'm inclined to see the coverage % as a motivator and forcing function to make the improvements as quickly as possible so we CAN show users that we maintain a high bar. Seeing the current number is the forcing function.

  2. If it's the case that we have 68% currently, it's not the best level of ownership if we're obfuscating that fact from users. That for users to discover that for themselves they would have to jump through so many hoops to discover. This gives them a false sense of safety.

This is really just my opinion, as I'm also perfectly fine with having 3 PRs:

  1. PR to configure Pytest with CI
  2. Get coverage to above 90%
  3. Report coverage with status badge

weklund avatar Jul 04 '25 01:07 weklund

Hey @weklund thanks for sharing your thoughts on this. I appreciate you helping me understand your view here. Yeah, I think I'd feel uncomfortable publishing it at 68% - let's revisit this at a higher %

Voyz avatar Jul 07 '25 04:07 Voyz