common library: Support for unique test names in library

library: Support for unique test names in library

Open MattWherry opened this issue 5 years ago • 18 comments

Summary

This is a continuation of a discussion on the Slack channel with @tooky @brasmusson @mattwynne and others.

I've tried to cover as much background as I can given the limitations of markdown. Really this is about trying to arrive at a standard naming convention for Pickles in order for tests from many different test frameworks across the Gherkin ecosystem to be reconciled with their corresponding results.

It's similar, but not quite the same (I think) as some of the previous conversations on linking Gherkin with a test management system.

I've tried to include as much context as I can - and stick to the issue template. Though I'm sure it will read a bit funny. It might be best to look at the "context and motivation" section to understand why I think it would be useful.

Expected Behavior

It would be lovely if all Gherkin test results had a standardised identifier that was easily linkable to the Gherkin that the test result came from without any a priori knowledge of which result belongs to which feature file.

Current Behavior

The issue as I see it is that each test framework uses a different naming convention for each test (pickle) - and the naming conventions that are used are "lossy" - for example, SpecFlow's MSTest doesn't include line number information, SpecFlow's XUnit truncates test names at a certain number of characters (long scenario outline names with quite a lot of columns/long content in the table). Karma-Gherkin-Feature's were so inconsistent we had to fork it because we just couldn't make it work.

In discussion with @brasmussen in the Gherkin Slack space, he suggested using file and line number - which may work (in principle) for automated tests - except the frameworks out there don't include this information in their test output - or at least, don't all include this information.

Possible Solution

I wondered if we could encourage consistency in the tool ecosystem by providing one (or more) standard identifiers for tests from the Gherkin parser.

It wouldn't fix all problems, and it would need uptake from the folks who write the test frameworks, but if it's easier to be consistent than to be inconsistent, we might see improved interoperatiblity between tools in the ecosystem.

One thing to note is that I think there's a marked difference between our automation and manual use cases (see context and motivation below):

For the automation use cases, we can assume that the test results are "contemporaneous" with the feature file - that is - if they claim to come from a given feature file, they all come from the same version, so somehow using feature name and line numbers in an identification scheme might work (or even something more unique, like a digest of the whole feature file content and the line number).
For the manual use cases, we need to consider the case where we have a feature with (say) 8 scenarios, one of which is manual, and the remaining 7 automated, were we to something like feature name and line number, when anyone changed one of the 7 automated result, it would change the ID of the manual test also - and would invalidate the linkage to the previous result. -> In this instance, I wondered if something like a digest of the feature name, and the content (including arguments, data tables etc.) of each pickle would be useful.
In either cases, even using manually added unique tags to scenarios wouldn't really help - as most of the test frameworks I've seen don't include the tags in the results anyway - and it would require a herculean work of manual effort to maintain consistency.

Because of the new cross-language support in Gherkin 6, it seems a real opportunity to help improve interoperability across the ecosystem as in time, most test framework implementations would come to rely on the new parser (with whatever test ID scheme it contained).

Context & Motivation

We like ATDD/BDD!

We really like BDD/ATDD.

We are a large organisation (ca. 400 developers/testers)

We have a relatively small number of products, but they are large.

We have a large number (~16000) of Gherkin files describing our products. Most written in the style of tests, but increasingly as proper executable specification.

We have a mixture of tools and technologies - some more historic, some less so.

Many of our automation tests are written in Gherkin. We have a number of tools in use - SpecFlow2.0 (MSTest), SpecFlow3.0 (XUnit), karma-jasmine-feature (maybe to be replaced with cucumber.js).

Are we finished yet?

Sometimes tests don't get run. Sometimes tests get disabled. In some frameworks (like SpecFlow), we get a result that says the test was ignored. In other frameworks, we don't get a result at all.

It's therefore really important for us to use to be able to reconcile the test results with the specification (i.e. the Gherkin) to know we've actually implemented everything we should (and someone hasn't accidentally left a test turned off because it was flickering, or something like).

So we trace to our Gherkin Specifications by tagging the Feature, Scenario, Outline or Example table with the external requirement ID (so we can review which detailed specs address each requirement, and whether we think that's enough)

and we reconcile the test results we have with the Gherkin Specification to make sure all the right tests have been run.

Automation

Because we use a mixture of technologies, we also have a mixture of test frameworks (some SpecFlow/MSTest, some karma-gherkin-feature, some SpecFlow/XUnit, some cucumberjs). Each has a different convention for how it names pickles in the results - and all are in some way "lossy".

Our results can be quite fragmented - because our Gherkin is a behavioural specification (and we use them to generate product documentation), when they are written, we pay no mind to how fast or slow a behaviour is to test - only whether we have refined and specified the right behaviour.

This means that for a given feature file, some pickles are run at one (fast) stage of the pipeline, and some at later (e.g. overnight) stages of the pipeline, and at the end, we have to aggregate many different results files together and tie them back to the feature file (see earlier) to make sure we've covered everything.

In many of our pipelines (cloud and deployment environment), the tools don't have reliable access to the original Gherkin at the point the tests are run.

We have similar arrangements with our nightly builds - where tests are run off-line in a deployment environment.

Our cloud pipelines build the test executables, pack them into docker images, spin them out to multiple deployment environments in the cloud, and then post the results back.

Because of the multiplicity of tools involved (and their not-quite-consistent naming conventions), the fact that lots of those tools don't exist in the deployment environments, the fact that many points in out build pipeline don't have access to both the Gherkin and the test results at the same time, this makes reconciling our tests and test results really quite challenging.

Manual

Also - because we really like BDD/ATDD, we don't pay any mind when writing feature files as to whether we will automate a given scenario or whether we will run it manually. Because the feature file is the specificiation - which doesn't really care how we intend to test it.

In these instances, we tag scenarios (or other taggable constructs) "@manualtest", and have a tool which generates a pro-forma PDF form which our manual testers use to complete their results.

Those PDFs are also reconciled against the orginal feature, showing that between the automation and manual test results, we have results covering the whole specification.

Your Environment

Version used:
Operating System and version:
Link to your project:

May 29 '19 12:05 MattWherry

@MattWherry - looks like you forgot to add any details to this ticket. Please open a new ticket with more details.

May 29 '19 12:05 aslakhellesoy

@aslak @SabotageAndi @gaspar, @thomas, @romain - @mpkorstanje suggested you might be interested in this too....

Apologies for the empty ticket. I fat fingered the submit button while typing the description.

May 29 '19 12:05 MattWherry

Yay! Reopening

May 29 '19 13:05 aslakhellesoy

There's lot here to parse @MattWherry!

I think that what I'm hearing is you'd like each pickle to have an identifier (ID) that is:

unique within the whole set of possible pickles from the project's feature files (even if only a few were run)
persistent across some reasonable modifications to the feature files
linked back to the source scenario that the pickle was compiled from

Do I have that right?

For (3) I presume (though I'm not entirely clear) you wouldn't want to include git revision info or anything in the ID itself, that you could capture that separately?

May 29 '19 21:05 mattwynne

Yeah. Sorry about that. The template said "motivation and context".... So....

I'll try to be briefer:

1: yes 2: yes 3: That would be splendid. Or something like. But I thought there might be a few practicalities: 3.1: It might (and probably rightly) get me publically ridiculed for suggesting building vc awareness (and a specific vc at that) into a language parser :-).

3.2: Would it be practical for a test framework to use something like that in a test name? For example, there are all sorts of "what's a reserved character or valid symbol name in js, c#, language-du-jour etc. I've seen the hoops Specflow 2 and karma-jasmine-feature jump through to avoid this.

But other than that - yes!

3.2 was what got me thinking about digests. Hexified, they're most likely short and ascii enough to be valid symbol names in any auto code generation that test frameworks might use, and probably still just about unique enough to be useful.. .

May 29 '19 21:05 MattWherry

You guys must have similar fun with Jam, right? - Tying the results back to the feature file when they've been posted in through the Rest api?

May 29 '19 21:05 MattWherry

Also - To point 1: Unique - to the extent possible across many projects. Some of our dev projects span 30 repos....

May 29 '19 21:05 MattWherry

Yeah, with Jam we mostly use the file:line along with the Git sha. For Java IIRC we have to do something else as the line number is lost somewhere along the way.

May 29 '19 22:05 mattwynne

Cucumber 1.x used a rather old version of Gherkin. The line numbers should be available since 2.x. At least that's where I would imagine the problem to be.

May 29 '19 22:05 mpkorstanje

I think that what I'm hearing is you'd like each pickle to have an identifier (ID) that is:

unique within the whole set of possible pickles from the project's feature files (even if only a few were run)

persistent across some reasonable modifications to the feature files

linked back to the source scenario that the pickle was compiled from

Do I have that right?

1: yes 2: yes 3: That would be splendid. Or something like.

Requirement 1 is just a tooling issue and should be easy enough to guarantee for any reasonable solution to the rest of the problems.

Requirement 2 more or less necessitates that the identifier be a distinct and explicit part of the Gherkin text in the feature file instead of being implicit like line numbers are. This could be a new addition to the language (we added the Rule keyword recently, so changes are certainly possible) or an existing construct could be co-opted (which is what I ended up doing with my cataloging solution).

Requirement 3 would be implementation specific unless all implementations generate results in the same way. I know that in the Ruby version of Cucumber, the test objects provided to the hooks provide access to the various pieces of the test. Presumably, the shared formatters would be a way to ensure that the hypothetical identifier was always included in the results.

Jun 05 '19 17:06 enkessler

Thanks for the feedback @enkessler - 1: I would have thought so. 2: I guess it depends on what we consider to be "reasonable modifications" - That was more-or-less what was behind my ramblings about digests of pickle contents. 3: Yes - there's a lot of inconsistency. In SpecFlow, there seems to be very little information available in the MSTest or XUnit output. I don't know about cucumber.js. Re: Shared formatter ensuring identifier was always in the results - Well. I was more thinking "encouraging", by making it easy. But if there was a way to ensure, that would be awesome! I don't have enough understanding of the architectural vision for the parser and frameworks to know if that's possible.

Jun 06 '19 12:06 MattWherry

I like your cataloging solution - we'd considered doing something similar by adding GUIDs as tags to the relevant grammar elements in our git checkin hooks - that helps with test management solutions, but tags don't appear in all of the test result formats.

Jun 06 '19 12:06 MattWherry

I guess it depends on what we consider to be "reasonable modifications"

Any modification is a 'reasonable' modification. ;)

Don't like the name of a test? Change it. An extra pop-up window now exists and has to be handled? Stick a new step in the test. The team remembers that that kind of thing shouldn't be handled at the feature file level? Yank the step back out. Your company roles out the new standard that all tests must be written in 2nd person future perfect tense and/or Klingon? Everyone will immediately quit but, sure, go ahead and rewrite all of the Gherkin.

In none of those cases did any of the tests actually stop being the same tests. Still proving the same things for the same reason. The How may have changed but the What did not. That is why my view is that the only way to properly identify a test is by adding some aspect whose only point is to be an immutable identifier. By doing so, all of the other parts of the test which already have a purpose that they shouldn't be restricted from changing in order to fulfill are free to do whatever. The additional identifier property, on the other hand, should only change under one circumstance: when a human decides "yeah, this test isn't really the same test anymore".

tags don't appear in all of the test result formats

Custom formatter. Bam! Unless the implementation that you are using (again, I'm only familiar with the Ruby implementation) does not allow access to the all of the fiddly bits of the test via whatever object is handed to the formatter methods, you should be able to ensure that any information that you like is in a test result. In the past, I've made custom formatters to stick the results in a DB and saving off the ID was just one more column to populate.

Alternatively, a two step process:

Use whatever formatter you are already using and also use a test hook jot down the test's ID at runtime.
Somehow unify those two things after the fact.

Jun 06 '19 17:06 enkessler

Very thoughtful. Makes me think:

1: I wish all the implementations were like the Ruby one. Maybe that changes the complexion of my earlier thoughts. Maybe if the information were readily accessible in other test frameworks, just some reasonably unique ff, content, line based thing would be perfectly adequate for automated tests.

2: maybe the manual test thing is something we could bite internally.

3: Though that still make my head spin c.f. CI pipeline and staging...

FTR: If a butterfly flaps its wings in Brazil, we would be rerunning our manual tests anyway. And roundly refusing any Klingon language policy changes. Or at least making the person proposing it rewrite the feature files and rerun the tests themselves :-)

Manually tagging is such a horrid solution though. I get your points, but it soooo, soooo would interfere with the fluency of the whole process.

Which is one reason we love it so much in the first place...

Certainly food for thought...

Jun 06 '19 17:06 MattWherry

Manually tagging is such a horrid solution though. I get your points, but it soooo, soooo would interfere with the fluency of the whole process.

Manually tag things? My dear Matt, if I thought that I could trust a person to have the consistent motivation and accuracy to do it, I wouldn't have created a computer program to do it instead. :P

Unless I am misunderstanding what you mean?

All the person does is run a script and the tags appear. It sounds like you have both manual and automated tests living in the same feature files, so there doesn't even have to be a distinction between the two, as far as unique identification goes. The automated tests will just, presumably, have results produced more often.

Jun 06 '19 18:06 enkessler

I would love to try and find a solution that didn't involve putting additional stuff into the Gherkin text if possible, manually or automatically.

The additional identifier property, on the other hand, should only change under one circumstance: when a human decides "yeah, this test isn't really the same test anymore".

It has always seemed to me that a reasonable proxy for this human decision is if the scenario's name changes. The obvious problem with using this as your actual identifier is that it's unlikely a scenario name on its own would be unique within the whole possible set of pickles.

Jun 10 '19 12:06 mattwynne

There are many different strategies for assigning a unique id to a pickle. They each have their strengths and weaknesses. We should pick one based on the capabilities we want from various applications.

The most basic capability we want is to link the result of a test case (pickle) back to the source. This can be done with a path:line id. These are unique and simple to implement. Applications that track multiple versions of results/sources, can create composite keys based on the path:line id and e.g. the git sha and cucumber execution id.

Another more advanced capability is to model scenarios/tests/pickles as entities with an id that allows to track and follow their evolution over time. Consider the following scenario:

# features/orders.feature
13: Scenario: Order can be looked up by tracking id

A few weeks later this scenario has been refactored:

# features/shipment.feature
19: Scenario: Shipment can be looked up by tracking id

Many properties have changed:

The file changed from features/orders.feature to features/shipment.feature
The name changed from Order can be looked up by tracking id to Shipment can be looked up by tracking id
The line number changed from 13 to 19
Some steps may also have changed

Conceptually, this is the same scenario at different times. It's like looking a picture of myself when I was 7 and when I was 45. It's still Aslak, he just moved, changed his name when he married (I'm old-fashioned - I didn't), grew taller and gained a bit of weight.

How do we track this evolution with files and file contents? This is a non-trivial problem to solve. There has been some great research done in this space, and I have been following it with passion and fascination for over 5 years. Here are some of the algorithms:

ldiff
lhdiff (video) - an improvement of ldiff.
anchoring using the Smith-Waterman algorithm - implementation in rust and python (deprecated)

If we were to adopt one of these algorithms, we would have a way to assign unique ids to scenarios that would satisfy both of these capabilities - link results back to source, and follow the evolution of tests, treating them as entities.

I think we should give it a go. The tricky part is to do it in a way that is portable across all our platforms without having to maintain half a dozen implementations. That is a general challenge we have for other components such as gherkin, cucumber expressions, tag expressions etc. So far it looks like we'll try to implement this shared functionality in Go, and distribute it for many platforms. My preference would be to use wasm as the cross-platform format, but wasm execution isn't well supported across platforms yet (although most of them now compile to wasm).

Jun 10 '19 16:06 aslakhellesoy

I would love to try and find a solution that didn't involve putting additional stuff into the Gherkin text if possible, manually or automatically.

Unfortunately, it is the only way that comes to mind when I think of an 'explicit' style solution. For an 'implicit' style solution, @aslakhellesoy has better ideas but, as he mentioned, they are non-trivial. I tend to lean towards to an 'explicit' solution because there is no guesswork involved, however intelligent that guessing may be.

It has always seemed to me that a reasonable proxy for this human decision is if the scenario's name changes.

I once again must envy the environments in which many members of this community seem to work. At the kind of places where I have worked, if a month went by without some feature or other needing a touch up due to the hasty/uninformed/lazy manner in which it was written then I would be pleasantly surprised. ;)

Jun 10 '19 18:06 enkessler

common common copied to clipboard

library: Support for unique test names in library

Summary

Expected Behavior

Current Behavior

Possible Solution

Context & Motivation

Are we finished yet?

Automation

Manual

Your Environment

common
common copied to clipboard