pynguin icon indicating copy to clipboard operation
pynguin copied to clipboard

Restructuring help and questions/discussion. (Issue was: honor max-iteration setting)

Open inktrap opened this issue 3 years ago • 9 comments

I am runing pynguin 0.17.0 with Python 3.9.10. I created some example code here:

https://github.com/inktrap/pynguin-example

Basicly it seems that pynguin does not value the max iteration option and perhaps other options too.

inktrap avatar Feb 15 '22 13:02 inktrap

Dear @inktrap ,

Thank you for reporting this. It is, however, not a bug, maybe a lack in documentation. In order to use maximum iterations as a stopping criterion, one not only needs to set the number of iterations, as you did in your example, but also select the stopping condition using --stopping-condition MAX_ITERATIONS.

I've tested the following with Pynguin 0.17.0 and Python 3.10.2 running from its source checkout root:

pynguin \
    --project-path ./docs/source/_static \
    --output-path /tmp/pynguin-results \
    --module-name queue_example \
    -v \
    --seed 1629381673714481067 \
    --stopping-condition MAX_ITERATIONS \
    --maximum-iteration 5

Doing this shows you five algorithm iterations; note that iteration 0 in the output is the initial population for the default DynaMOSA algorithm, which is not counted as an iteration. I am using a seed here that would cause eight iterations for full coverage on my system (different seeds might converge faster) to show the early stop of the search.

I admit that the documentation is lacking the information about selecting other stopping criteria. We will improve our documentation for a future release of Pynguin.

Best, Stephan

stephanlukasczyk avatar Feb 15 '22 14:02 stephanlukasczyk

Thanks @stephanlukasczyk

  • I changed the options and no test output is created, see example

Depending on the usecases of your intended audience I would advocate to simplify the CLI. Here is my unsolicited personal opinion ;)

  • perhaps group options by category, like git does?
  • maybe some options can be considered expert-options if they are used rarely?
  • perhaps most of your intended audience's usecases could be met with some merged options?
  • what is the point of max-iteration if there are more than max-iterations? Even if max-iteration is not a stopping-condition and there is something else to be done I would expect max-iterations to limit iterations …
  • perhaps a per project config file ./.pynguin.ini and a per user config file like ~/.pynguin.ini or ~/.config/pynguin.ini could simplify the amount of options that have to be given via CLI? I would want to put stuff that changes rarely in these and hide those cli-options as part of the expert-help mentioned above.
  • btw. I just want to start with some simple unittest-skeletons (c.f. auger, pythoscope)

If the CLI interface would be simpler, there would be less confusion with user issues and also less of the documentation or a lack thereof. I think it is a little bit overwhelming. I don't want to sound ungrateful … I hope this feedback is helpful to you.

inktrap avatar Feb 15 '22 15:02 inktrap

Hi @inktrap,

I changed the options and no test output is created, see example

When you launch Pynguin, it employs a search algorithm, namely DynaMOSA in your example, to find test cases which

  • achieve maximum branch coverage on the given module (primary goal)
  • are minimal in size (secondary goal)

The output that you observed, i.e., a single test case containing only a call to main() is the best solution that Pynguin could find during it's search, that is closest to fulfilling both goals. Pynguin is also 'smart' enough to know that main is a void function, i.e., one that does not return any meaningful value, which is why it does not add any assertions either. So far, everything works as expected. See this preprint for more details.

I assume your argument is that just calling main() is not a very meaningful test, which I absolutely agree with, but as explained above, from Pynguin's perspective it's the best solution. The module that you are testing is a stand-alone script, i.e., it has a main() function which is executed when the script is directly launched (if __name__ == "__main__":). I guess it would make sense to enhance Pynguin in a way that it does not consider some functions/methods for test generation, e.g., main() in your case, as these are most likely not ideal for testing. For example by adding a comment like # pragma: no cover, as it is used by Coverage.py. I guess the only reason we haven't implemented this for now is just that seems to be a lot of engineering work to reliably figure out to which function/method a non-docstring comment belongs to. A quick fix for your case would be to split your module into two modules: one for the functionality (minus and plus) and one for the main function and then letting Pynguin run on the former.

Note that the test cases that will be generated are probably still to large, i.e., contain superfluous lines, as explicit minimization of test cases is also yet to be implemented.

Depending on the usecases of your intended audience I would advocate to simplify the CLI. Here is my unsolicited personal opinion ;)

Thank you for your feedback, we will try to take it into account :) For now, Pynguin is primarily a research prototype and thus might still have a lot of rough edges for end users. As Stephan pointed out, we are continually working to improve Pynguin in various ways, including usability, but sadly we don't have enough time to work on everything we'd like to.

Maybe to also address a few points directly:

perhaps a per project config file ./.pynguin.ini and a per user config file like ~/.pynguin.ini or ~/.config/pynguin.ini could simplify the amount of options that have to be given via CLI? I would want to put stuff that changes rarely in these and hide those cli-options as part of the expert-help mentioned above.

While we don't have explicit support for a config file (yet), our CLI parser is based on Python's argparser, so you should be able to create a file with your default options, e.g.

-v 
--project-path ./ 
--output-path ./pynguin 
--module-name example.main 
--stopping-condition MAX_ITERATIONS 
--maximum-iteration 100

And launch Pynguin with pynguin @your-file.

perhaps group options by category, like git does?

I know we have a plethora of config options, but we already tried to group them into categories, e.g., options related to the search, statistics, output, etc. You can see those by running pynguin --help

I hope my answers could help you a bit.

Best regards Florian

Wooza avatar Feb 16 '22 08:02 Wooza

Dear @inktrap ,

Maybe closing this immediately was a bit of a quick shot, please excuse.

I changed the options and no test output is created, see example

I've just checked out your example repo, and using the provided run.sh script I am able to generate one test case:

# Automatically generated by Pynguin.
import example.main as module_0


def test_case_0():
    module_0.main()

This test case reaches 80% branch coverage. To introspect what branches are not covered, I've generated an HTML coverage report by adding the options --report-dir . --create-coverage-report True to the command line, which creates an HTML report called cov_report.html in the current directory. The report shows that the if __name__ == "__main__" branch causes the issue. This condition will never be true for Pynguin, because it won't set the value of __name__. Thus, it cannot achieve higher coverage for your example module.

Regarding your truly helpful feedback (thank you for this), allow me to give a few comments:

perhaps group options by category, like git does?

I was thinking about this; redesigning the CLI (as well as the API) is something on my wish list, however, not the hightest priority currently.

maybe some options can be considered expert-options if they are used rarely?

That's true, many options are of interest mostly for research purposes but maybe not for practical usage. We try to set somehow reasonable default values (taken from previous research) such that a user does not have to deal with them.

perhaps most of your intended audience's usecases could be met with some merged options?

I am not sure whether merging options would help or cause other confusions. The audience's usecases might vary: researchers might be interested to tweak each and every parameter separately, while developers might prefer a pre-configured out-of-the-box setting.

what is the point of max-iteration if there are more than max-iterations? Even if max-iteration is not a stopping-condition and there is something else to be done I would expect max-iterations to limit iterations …

The stopping condition itself works, I've confirmed that. Using it might not be intuitive, I agree on that. Regarding the one additional iteration that is listed in the log output: that is a technical detail on how the DynaMOSA (and other) algorithm works; it starts with a randomly created initial population of test cases (iteration 0). Afterwards, it starts evolving the population.

perhaps a per project config file ./.pynguin.ini and a per user config file like ~/.pynguin.ini or ~/.config/pynguin.ini could simplify the amount of options that have to be given via CLI? I would want to put stuff that changes rarely in these and hide those cli-options as part of the expert-help mentioned above.

Something like this should already be possible: Pynguin's command-line interface is based on Python's argparse library (although using simple_parsing as a front end to it). As stated in argparse's documentation, one can specify options also in a file. The used prefix character is @ in Pynguin.

btw. I just want to start with some simple unittest-skeletons (c.f. auger, pythoscope)

I only know of auger, which in my understanding executes existing unit tests, traces these executions, and infers further tests from the traces. This differs from what Pynguin wants to achieve. However, you can use existing test cases as a basis by setting --initial-population-seeding True --initial-population-data /path/to/test.py. Please note that this is just a very prototypic implementation which does not cover all aspects present in many test suites.

Best, Stephan

stephanlukasczyk avatar Feb 16 '22 09:02 stephanlukasczyk

Hi @Florian! Thank you for explaining some internals and getting back to me, I apprechiate it.

I assume your argument is that just calling main() is not a very meaningful test

Yes, let me clarify: I don't think it is a problem that a test for main() is created, but that it is just for main. I expected tests for the included functions minus() and plus(). (Tihs is already adressed by @stephanlukasczyk)

I guess it would make sense to enhance Pynguin in a way that it does not consider some functions/methods for test generation, e.g., main() in your case, as these are most likely not ideal for testing. For example by adding a comment like # pragma: no cover, as it is used by Coverage.py. I guess the only reason we haven't implemented this for now is just that seems to be a lot of engineering work to reliably figure out to which function/method a non-docstring comment belongs to.

But to add something to exclusion of functions/methods: Why shouldn't it be part of the docstring as extra info? And if it shouldn't … it could be appended to the end of the line, like mypy's # type: ignore, that has scope over the import statement it is appended to. But why shouldn't something like this work?

>>> def foo(bar): # testgen: ignore
...     return True
...

In general I would expect that if something like this is at the end of some syntactic element that spans multiple lines, it should apply to the whole element.

A quick fix for your case would be to split your module into two modules: one for the functionality (minus and plus) and one for the main function and then letting Pynguin run on the former.

Yes, obviously it was just an example. I separated it and it worked, but I had to specify the module explicitly as in src.lib.ops. I haven't looked at all the options yet but ideally I would prefer some auto-discovery.

Note that the test cases that will be generated are probably still to large, i.e., contain superfluous lines, as explicit minimization of test cases is also yet to be implemented.

Surprisingly, it was not, it was much smaller than expected. I am missing tests for all the numeric types I included in the annotation and different sets of values.

You already mentioned in the docs that the seed value introduces some randomness, but this seems weird:

  • sometimes one or two successful tests are generated
  • sometimes one or two unsuccessful tests are generated
  • sometimes one of each is generated.

I should open a separate issue for restructuring help/cli, but if someone already plans to do that, I wont. But just to clarify:

pynguin --help produces 376 lines of output. I think this can be split by grouping or commands. This example is very git-oriented:

pynguin --help            # gives help for the most used commands (no dash or double dash)
pynguin command --help    # gives help for a specific command
pynguin --help -a         # gives help for all the commands, see "porcelain"

inktrap avatar Feb 16 '22 12:02 inktrap

@stephanlukasczyk

Dear @inktrap ,

Maybe closing this immediately was a bit of a quick shot, please excuse.

No worries. This discussion got way more informative and longer than I anticipated anyways. As I mentioned before separate issues for e.g. docs could be useful.

This condition will never be true for Pynguin, because it won't set the value of __name__. Thus, it cannot achieve higher coverage for your example module.

I know that pynguin is not static and executes the code and the ifmain guard is the way to ensure that something runs if it is part of a script that is executed directly. I guess this would reach more into fuzzing territory if complex entry points with arguments/options would be tested.

However, I am providing the name of the module explicitly, and if that module is executed, __name__ is set to __main__https://docs.python.org/3/reference/import.html#__name__ Aren't you using inspect at some point? Couldn't that be used to test the functions individually, except what is run during ifmain? (Edit I mean … this sounds paradoxical because the point is the execution, but I want to create unittests … and the smallest units there are the functions I wrote) pythoscope is abandoned. I guess most people rely on their IDE/PyCharm for that.

As for the discussion about options … I really appreciate your points on that. Yes, merging options and changing semantics might not be a good idea. I gave some git-oriented examples earlier, if you want to create an issue or work out what a good structure for pynguin would be I would like to have a look at that.

what is the point of max-iteration if there are more than max-iterations? Even if max-iteration is not a stopping-condition and there is something else to be done I would expect max-iterations to limit iterations …

The stopping condition itself works, I've confirmed that. Using it might not be intuitive, I agree on that. Regarding the one additional iteration that is listed in the log output: that is a technical detail on how the DynaMOSA (and other) algorithm works; it starts with a randomly created initial population of test cases (iteration 0). Afterwards, it starts evolving the population.

But to get back to the original issue:

My point is that this:

export PYNGUIN_DANGER_AWARE=1; pynguin -v --create-coverage-report True --project-path ./ --output-path ./pynguin --module-name example.main --maximum-iteration 100

runs for more than 100 iterations.

Also if I don't specify a stopping condition and run pynguin with a seperate module, it exits very quickly, the stopping condition is time:

$ ./run-lib.sh
[13:43:44] INFO     Start Pynguin Test Generation…                                                                                                         generator.py:97
           INFO     No seed given. Using 1645015423754040869                                                                                              generator.py:173
           INFO     Collecting constants from SUT.                                                                                                        generator.py:182
           INFO     Using strategy: Algorithm.DYNAMOSA                                                                                   generationalgorithmfactory.py:235
           INFO     Instantiated 3 fitness functions                                                                                     generationalgorithmfactory.py:321
           INFO     Using CoverageArchive                                                                                                generationalgorithmfactory.py:279
           INFO     Using selection function: Selection.TOURNAMENT_SELECTION                                                             generationalgorithmfactory.py:254
           INFO     Using stopping condition: StoppingCondition.MAX_TIME                                                                  generationalgorithmfactory.py:90
           INFO     Using crossover function: SinglePointRelativeCrossOver                                                               generationalgorithmfactory.py:267
           INFO     Using ranking function: RankBasedPreferenceSorting                                                                   generationalgorithmfactory.py:287
           INFO     Start generating test cases                                                                                                           generator.py:295
           INFO     Iteration:     0, Coverage: 1.000000                                                                                              searchobserver.py:66
           INFO     Algorithm stopped before using all resources.                                                                                         generator.py:300
           INFO     Stop generating test cases                                                                                                            generator.py:301
           INFO     Start generating assertions                                                                                                           generator.py:321
           INFO     Setup mutation controller                                                                                                        mutationadapter.py:68
           INFO     Build AST for example.lib.ops                                                                                                    mutationadapter.py:54
           INFO     Mutate module example.lib.ops                                                                                                    mutationadapter.py:56
           INFO     Generated 2 mutants                                                                                                              mutationadapter.py:64
           INFO     Running tests on mutant   1/2                                                                                                assertiongenerator.py:157
           INFO     Running tests on mutant   2/2                                                                                                assertiongenerator.py:157
           INFO     Export 1 successful test cases to ./pynguin/test_example_lib_ops.py                                                                   generator.py:338
           INFO     Export 1 failing test cases to ./pynguin/test_example_lib_ops_failing.py                                                              generator.py:348
           INFO     Writing statistics                                                                                                                   statistics.py:350
           INFO     Stop Pynguin Test Generation… 

inktrap avatar Feb 16 '22 12:02 inktrap

Dear @inktrap ,

Sorry for replying late. Let me address some great points from your two above posts:

Hi @florian! Thank you for explaining some internals and getting back to me, I apprechiate it.

I assume your argument is that just calling main() is not a very meaningful test

Yes, let me clarify: I don't think it is a problem that a test for main() is created, but that it is just for main. I expected tests for the included functions minus() and plus(). (Tihs is already adressed by @stephanlukasczyk)

I totally understand that expectation, I would have had a similar one, I guess. An automated tool, unfortunately, might not see why calling minus() and plus() might be more reasonable than calling only main() (or at least calling all three of them). Since executing main already covers the other two methods, there is no need to execute them separately.

But to add something to exclusion of functions/methods: Why shouldn't it be part of the docstring as extra info? And if it shouldn't … it could be appended to the end of the line, like mypy's # type: ignore, that has scope over the import statement it is appended to. But why shouldn't something like this work?

I agree on having some way to execlude functions/methods from test generation. It is, however, non-trivial to build something like this reliably, since a comment such as # testgen: ignore is neither part of Python's AST nor its bytecode. Still, it should be doable to make it working for most cases.

Yes, obviously it was just an example. I separated it and it worked, but I had to specify the module explicitly as in src.lib.ops. I haven't looked at all the options yet but ideally I would prefer some auto-discovery.

We thought about auto-discovery some time ago and back then decided to not do it in the first place. It might, however, be a nice future extension—other tools such as mypy or pytest to also provide something like that.

Surprisingly, it was not, it was much smaller than expected. I am missing tests for all the numeric types I included in the annotation and different sets of values.

That's a problem how Pynguin treats the code internally. Running it with one of the types is sufficient to fully cover it, thus it will not attempt any further types. One could argue, of course, that there should be tests for all types; however, this could cause exponentially many test cases to have all combinations, which might not be ideal as well.

You already mentioned in the docs that the seed value introduces some randomness, but this seems weird:

That is caused by how the initial population is sampled, I guess. I combines random statements, basically, which might fail during execution; these test cases are then part of the failing test cases. I agree that this might not be the best way of doing this, when using the tool in practice.

I know that pynguin is not static and executes the code and the ifmain guard is the way to ensure that something runs if it is part of a script that is executed directly. I guess this would reach more into fuzzing territory if complex entry points with arguments/options would be tested.

However, I am providing the name of the module explicitly, and if that module is executed, name is set to main … https://docs.python.org/3/reference/import.html#name Aren't you using inspect at some point? Couldn't that be used to test the functions individually, except what is run during ifmain? (Edit I mean … this sounds paradoxical because the point is the execution, but I want to create unittests … and the smallest units there are the functions I wrote) pythoscope is abandoned. I guess most people rely on their IDE/PyCharm for that.

I'll look into this again. Yes, we are using inspect. I'll make my mind whether we can come up with something better suiting for unit-test generation than the current implementation.

Also if I don't specify a stopping condition and run pynguin with a seperate module, it exits very quickly, the stopping condition is time:

I'll check this as well.

Regarding the CLI: it is something that is on my list. Unforunately, verious personal issues and other work keep me away from this (and other things in Pynguin). If I may, I'd come back to you targetting the CLI specifically, once I've made my mind of how it could look like, to get some feedback?

Best, Stephan

stephanlukasczyk avatar Feb 21 '22 11:02 stephanlukasczyk

If I may, I'd come back to you targetting the CLI specifically, once I've made my mind of how it could look like, to get some feedback? Sure, I'll be glad to have a look at it and to give some feedback :)

Also the other points, like tests for different types, are just a suggestion.

One could limit the amount of tests per type and start with one test per arg per (basic) type … So for a function with an arity of two where each arg is of Union[int, float, complex], this would be 9 combinations (set is not even needed):

>>> arg_1 = arg_2 = ["int", "float", "complex"]
>>> set([(a1, a2) for a1 in arg_1 for a2 in arg_2])

But ofc that is naive … how would you get all the different argument types defined by an abstract base class? What about async types? https://docs.python.org/3/library/typing.html#asynchronous-programming And so on.

What I like about pynguin is the potential … it could be useful for end-users that are developers/testers/qa, it could be integrated into an IDE at some point, or even an additional layer in a fully-automated ci-setup. Btw. is it still an open research question how the quality of a test(suite) is measured and what a minimal complete coverage constitutes?

inktrap avatar Feb 21 '22 15:02 inktrap

For restructuring help or maybe re-format some of the output, I'll like to point to:

  • https://github.com/Textualize/rich
  • https://typer.tiangolo.com/
  • https://github.com/kislyuk/argcomplete (didn't try)

inktrap avatar Mar 05 '22 10:03 inktrap