need closure on how to maintain impl lists
At one point we decided as a community that we were going to list implementations of RFCs in a table at the bottom of each doc. We created such tables and added a unit test that forces us to have such a table. The test also guarantees that we add data to those tables in a consistent way.
However, @swcurran expressed dissatisfaction with this mechanism because it scatters impl knowledge and creates a maintenance burden. He advocated tracking impls on an AIP basis, but not at the granular level of individual RFCs. (Stephen, if I've not summarized your concern completely, please chime in with improvements to that summary.) We talked these concerns through and agreed to many aspects of AIP best practice, but I don't think we ever got to closure on whether AIPs would supersede the tracking of individual RFC implementations, or exist as an additional piece of metadata.
RFC 0302 (AIP 1.0) currently has no listed impls, yet there are several compliant impls in the community. Should we be updating that doc?
I suggest we use this issue to remind us of this unfinished discussion, and possibly to accumulate comments on the topic. Discussed in Aries WG B call on 3 March 2021.
Tagging a few people that I know are interested: @b1conrad @TelegramSam @llorllale @andrewwhitehead
tagging @jonathan-rayback as well
I'd love to see a place where implementations were listed. I don't particularly care if it's on each RFC (though I don't mind), but I would very much like to see something on each AIP version. Thanks for bringing this topic up Daniel!
We think that the Aries Test Harness is getting sufficiently rich that we can use test cases for this. As such, we've created a start at showing this:
http://aries-interop.info/
The data on that is generated from time to time as things change (will become a GHA soon) that reports on a number of interop tests amongst frameworks (ACA-Py, AF-.NET, AF-JS and AF-Go) that are executed daily from the last source code in each repo. There are a set of tests that are based on RFCs and the test runs are scoped to run those that are claimed to be supported by the different frameworks.
The tests and GHAs that run the tests daily are all found here: https://github.com/hyperledger/aries-agent-test-harness
We're looking for folks to:
- add test cases (they are in behave)
- fix things in different frameworks to increase the number of tests passing for each interop combination -- aka "the blame game"
- add new agents/frameworks by packaging your agent/framework with an API used by the test harness to run tests.
We think this can be a test-driven way to demonstrate interoperability.
When I worked at my previous job, I was strongly opposed to this strategy, mostly for two reasons.
Reason 1: enterprise vs. mobile
Enterprise agents are designed to be automated (that's essentially their entire raison d'etre -- to do policy-driven work for institutions). Therefore, scripting how they behave in a test harness is natural. In addition, they typically communicate over HTTP, so adding a RESTful interface to allow control during a test is a trivial lift.
Mobile agents may naturally consume RESTful APIs, but they do not naturally expose them; most apps communicate over push notifications. In addition, they make decisions in a radically different way; instead of implementing enterprise policy, they are built to ask a human user interactive questions. Even assuming it's easy to proxy up some kind of RESTful channel over which a mobile agent can be controlled (and that this channel can easily short-circuit the push notification infrastructure dependency), there is little infrastructure in such agents to implement automated decisions under test conditions.
The net consequence here is that, if fitting an enterprise agent into the test harness as imagined is an effort of X units, fitting a mobile agent into the test harness is IMO at least a 10x effort. (Declarations that a trivial harness has been written that contradicts this don't convince me, because such a harness ignores the cost of doing the adapter right rather than as a toy -- and it ignores the expense of implementing a robust alternate strategy for decision making to replace human interaction. And I think this is an ongoing tax, not a one-time cost. We're talking about plumbing alternate codepaths with alternate security behaviors throughout a codebase, which is quite different from enterprise agents.)
It may be possible to get around some of this issue by testing mobile agents at the framework level instead of the app level. However, I think this gives an unrealistic view of interop, because it's focused on libraries or infrastructure rather than products. Just because framework X supports feature Y does not mean that all apps built atop framework X will have feature Y turned on -- or if they do, that the support for Y will come and go on the same schedule or version snapshot as the framework. That doesn't mean that framework interop is uninteresting. Quite the contrary; it tells us a lot. But I think the strategy you're advocating is a good framework (library) strategy, not a good agent (product) strategy -- and should be described as such.
For agents that feel this friction, it seems to me that declaring conformance based on a different test procedure is not unreasonable.
Reason 2: bilateralism decays into a reference agent
Whoever designs the test harness infrastructure will naturally design it to work well (read, "cheaper to code against") with their own agent, and will also naturally tend to use the test harness primarily to evaluate their own agent vis-a-vis all others. (Looking at aries-interop.info, this is exactly the pattern we see.) This is not sinister; even with the best of intentions, I'd almost certainly end up doing this myself, if I wrote a bilateral test harness. How could it be otherwise; of course everybody should prioritize their testing and coding investments to accomplish their own goals. Maintaining a full Cartesian product of all agent-to-agent combinations is expensive.
But given these simple truths, the natural trajectory of a bilateral testing strategy is to degenerate into a test-everything-against-a-reference-agent pattern of behavior. The reference agent gets disproportionate influence ("Is it easier to change the reference agent, or to adjust the spec to reflect what the reference agent is doing?").
This political reality was my main reason for advocating that the community instead develop a suite of tests that all agents must pass (and a test agent that we all agree could be a neutral third party without strategic baggage). I haven't claimed that this has exactly the same utility as a bilateral harness, or that a bilateral harness is undesirable -- only that it's problematic for normalized evaluation. I also admit that a test suite has problems of its own (expense comes to mind).
I believe these two issues may have a multiplicative effect. Because enterprise agents are cheaper/easier to adapt to the test harness, anyone who has to pay the far higher cost of test harness adaptation for a mobile agent incurs a carrying cost that makes it more likely they'll not have time to invest in maintaining their portions of a Cartesian product test matrix -- which in turn makes it more likely that their own agents will be marginalized by a better bilateral-->reference agent.
Now that my job has changed, I don't feel compelled to advocate quite as strongly. (And I'm trying to avoid my hard-core pushback from back in the day, which I regret.) I have always acknowledged that our community is a do-ocracy and that code wins -- so I appreciate the investment that http://aries-interop.info and the test harness represent, and I agree that it could be the basis for important insight. It's insight that could be largely automated, which is awesome. But I do feel it's important to articulate the reasons why this strategy is not a no-brainer, and what expenses, misalignments, and rifts it might exacerbate. Perhaps the benefits are worth the drawbacks, especially if we can nuance it a little bit to make it less onerous for mobile teams. Certainly we can derive value X with effort Y, where Y is not overly painful for many members of the community.
Or perhaps we should use bilateral testing to explore, but leave it to human processes to announce results that are nuanced and marketable.
Now that I've said my piece, I'm content to let this conversation evolve in whatever way the community likes. Thanks for listening.
I haven't looked at the test harness in any depth whatsoever - I just know it exists - and I wasn't aware that its test harness pitted each agent/framework against others and not against a reference implementation. Thank you for pointing that out @dhh1128 .
That's a long, thorough note, Daniel -- thanks! However, I don't think the arguments you give are compelling. Here is my rebuttal. I'm particularly thankful for your comment about mobile, as it has given me a great idea -- see below :-)
Mobile vs. Enterprise -- What I think you are claiming is that it is impossible to do CI testing on a mobile agent (can't control programmatically, can't automate), independent of AATH. I'd be very surprised if that were true, but I'm not an expert. The little bit of research I've done and the people I've asked suggest that we should be able to run a mobile agent to work with the AATH, but proof is in code. Based on your comments, I think you might misunderstand the architecture. The "Component Under Test" (CUT) does not need a REST interface, it just needs a way to be controlled, ideally programmatically - library calls, calls to a simulator, HTTP calls, manually -- anything. A "Test Agent" is a container that includes a controller (the backchannel) that receives HTTP requests and controls the CUT any way it can. If a mobile app can somehow be automatically tested, it can be put into the AATH. As far as the "intelligence" needed -- if you can't do that, you can't automate the mobile agent testing, whether for conformance or interop testing.
NOTE: If it really is true that CI mobile agent testing is not possible, and mobile agents MUST be tested manually, I'm absolutely certain we can use the AATH to direct a person to do the mobile testing manually. The backchannel would just display QR codes when needed, and text that directs the person with the mobile agent on what to do. That would actually be far less effort than a backchannel for a framework and now that I realize that, we will jump on building that. Good idea!!!
Bilateralism -- This one bugs me. AATH and the tests within are open source -- anyone can define the tests, add to them, complain about them and make them work with their "Component Under Test". There is no reason for it to devolve as you state unless no one else contributes, reviews or even just comments on the tests. In any test suite, there will come a time when something doesn't work and the participants must decide what part is broken (the RFC, the test, one agent, the other agent, etc.) Further, once what is broken is determined, the participants must also decide what to do that is best for all concerned. Usually, that will be to fix that which is broken, but in some cases that might be to fix something else. That's both OK and expected. I don't understand how interop testing benefits one implementation over another.
"leave it to human processes to announce results that are nuanced and marketable" -- We definitely agree on that. If you look at the test results that we published, there are three places where the results are human-edited and nuanced -- the "scope" (what each runset covers), the "summary" for each runset (a narrative about what is covered in the runset and why and what is expected), and the runset "status" section that is a narrative about the state of the testing for the combination of agents tested. The status is expected to be updated after every change in the results for each runset. That data is in the repo and can be updated by anyone that contributes a PR to the repo. Those three data points were our first attempt to achieve what you describe, and we're open to other ideas to provide that necessary nuance to the results.
Some other follow up notes on this:
-
In our experience, the cost to build a backchannel is about a person-month (~$20k US) based on the bounty paid to build the .NET backchannel that runs all the tests from the test suite that AF-.NET supports. The JavaScript backchannel took a lot less than that and again works for all the test cases it supports.
-
Maintaining and adding new tests takes even more effort (around an FTE -- but that could be shared across the community), and is rewarded by finding problems in a continuous integration context. We've seen the benefits.
-
In each case of new Test Agents, the tests have uncovered issues that have been addressed in one component or another, including one instance where the RFC was not being followed by any of the agents, and it was best to adjust the RFC.
-
We haven't got AF-Go really working yet, mainly because we don't have a person familiar with AF-Go working on it, but we're getting there. In what we have done, we've discovered a pile of issues -- and we still don't have interoperability.
We truly believe that we should follow the Telco industry in this work and enable easy interop testing across implementations, not just conformance. We believe the AATH is an excellent and (now that it's built) lightweight way to enable such full interoperability testing. Show us interoperability, don't just claim it.
@llorllale -- love to have you or someone from your team look at the tests and help with the fixes. FYI -- we are close on adding issue credential and present proof tests. There is a PR for that in AATH.
[Edit: I responded to some of Stephen's thoughts, but I've decided that was inconsistent with my determination not to say more than my original post -- so I've removed most of my response and will just let my first comment suffice. Stephen makes some good points.]
My two cents:
I really like the idea of the AATH. I need to ramp up on it (I just opened the repo now), but the idea of running these tests as part of CI/CD is appealing. I like BDD. I have no idea how costly implementing these into Evernym agent pipelines would be.
I like the idea of providing a manual testing hook for mobile apps as an alternative method for validating interop. So a +1 vote there.
Thanks.
On Sat, Mar 6, 2021 at 1:56 PM Daniel Hardman [email protected] wrote:
Mobile vs. Enterprise -- What I think you are claiming is that it is impossible to do CI testing on a mobile agent (can't control programmatically, can't automate), independent of AATH.
No. Evernym's Connect.Me has CI, and of course this is eminently doable. Any app codebase worth its salt will have some kind of CI. However, I know that C.M's is a combination of low-level unit tests (below the level where test harness hookup would be insightful) and high-level tests that simulate user input in the style of Selenium. What would be required to test an app against the test harness is a complete replacement for user input (basically lopping off the UI veneer of the app and guaranteeing a perfect MVC layering with interfaces convenient for a backchannel to plug in) -- or else a way to map backchannel commands to user input, going through a Selenium-type tool. I am claiming that these are considerably more complex than driving an enterprise agent -- 10x your person-month estimate which is based on enterprise adaptation.
NOTE: If it really is true that CI mobile agent testing is not possible, and mobile agents MUST be tested manually, I'm absolutely certain we can use the AATH to direct a person to do the mobile testing manually. The backchannel would just display QR codes when needed, and text that directs the person with the mobile agent on what to do. That would actually be far less effort than a backchannel for a framework and now that I realize that, we will jump on building that. Good idea!!!
I love this idea. But let's acknowledge that this approach is incompatible with the CI/automation goal.
Bilateralism -- This one bugs me. AATH and the tests within are open source -- anyone can define the tests, add to them, complain about them and make them work with their "Component Under Test". There is no reason for it to devolve as you state unless no one else contributes, reviews or even just comments on the tests. In any test suite, there will come a time when something doesn't work and the participants must decide what part is broken (the RFC, the test, one agent, the other agent, etc.) Further, once what is broken is determined, the participants must also decide what to do that is best for all concerned. Usually, that will be to fix that which is broken, but in some cases that might be to fix something else. That's both OK and expected. I don't understand how interop testing benefits one implementation over another.
It seems that you're saying that bilateralism doesn't have to decay if everybody will invest significant effort. I agree with the theory. My argument was about practice, though. In practice, everybody won't invest significant effort -- and we already observe the decay from the theory, even on the four frameworks that are being tested today:
[image: incomplete test matrix] https://camo.githubusercontent.com/32a2cf08a6e0e6a9fb4460afe30332d4775db08a0a545abd1875ecd8245730ff/68747470733a2f2f692e6962622e636f2f48786e483837672f53637265656e2d53686f742d323032312d30332d30362d61742d312d32352d32382d504d2e706e67
Half of the bilateral matrix (8 of 16 cells) is missing, and 4 of the green cells are the self-on-self combinations that don't yield interop insight (so only 4/12 of the interop insight is available).
Now, you could say that this is just an artifact of us being early; if the community commits to this path, there's no reason we need to live with an empty matrix. I agree. But again, that's a theoretical argument; I'm claiming that the matrix right now is a pretty good indicator of the unevenness we're going to see, and that this will only get harder when you have a matrix with 6 or 8 test targets (36 or 64 cells). I believe this is inevitable, practically speaking, and will naturally give one codebase the status of a reference agent. It's easy to see which is the reference agent in the graphic.
"leave it to human processes to announce results that are nuanced and marketable" -- We definitely agree on that. If you look at the test results that we published, there are three places where the results are human-edited and nuanced -- the "scope" (what each runset covers), the "summary" for each runset (a narrative about what is covered in the runset and why and what is expected), and the runset "status" section that is a narrative about the state of the testing for the combination of agents tested. The status is expected to be updated after every change in the results for each runset. That data is in the repo and can be updated by anyone that contributes a PR to the repo. Those three data points were our first attempt to achieve what you describe, and we're open to other ideas to provide that necessary nuance to the results.
This seems like a pretty nice first cut to me. Some of the nuance I'd like: let us talk about agent versions as they appear in an appstore, rather than codebase versions. Let us talk about hotfixes (which may appear on branches other than main). Let us talk about whether the same feature appears in both the android and ios versions (or the windows vs. mac vs. linux vs. embedded versions). Let us talk about whether a build contains a fix, but the appstore release is pending acceptance by third parties. Etc.
What I'm essentially saying here is that codebase interop and released interop are different, and that released interop is for marketeers. It could be totally valid for marketeers to point back to the codebase interop results, as long as they can put asterisks on it when they have a reasonable need to do so.
- In our experience, the cost to build a backchannel is about a person-month (~$20k US) based on the bounty paid to build the .NET backchannel that runs all the tests from the test suite that AF-.NET supports. The JavaScript backchannel took a lot less than that and again works for all the test cases it supports.
Right. Very believable. But I'm claiming mobile is a different animal.
- Maintaining and adding new tests takes even more effort (around an FTE -- but that could be shared across the community), and is rewarded by finding problems in a continuous integration context. We've seen the benefits... In what we have done, we've discovered a pile of issues...
Agreed. This is the high-value activity that I've called "exploratory testing."
Show us interoperability, don't just claim it.
I agree that this is highly desirable. But I don't agree that interop of a framework shows interop of products. It gives a true but different idea of what we're accomplishing.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hyperledger/aries-rfcs/issues/597#issuecomment-792053269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN2AGG6XDMPGQHLYY5FHLLTCKJJPANCNFSM4YR5PVQQ .
-- Jonathan Rayback Vice President Engineering Evernym, Inc.
I haven't looked at the test harness in any depth whatsoever - I just know it exists - and I wasn't aware that its test harness pitted each agent/framework against others and not against a reference implementation. Thank you for pointing that out @dhh1128 .
As far as I know there is no Aries reference agent available, nor is anyone planning to build one. However IF a reference agent were available it would be pretty straightforward to add a backchannel api and integrate it into AATH. If anyone has a reference agent and would like some help adding it to AATH I'd be happy to help.
Although Daniel removed his last set of comments from this thread, I was fortunate to get them in email and they spurred some fast advances to the tests runsets and the way the results are presented on the Aries Interop information page. Checkout the latest version of the summary here: http://aries-interop.info
Changes:
- We added the reverse runsets for all the viable combinations and published those -- to fill in the holes that Daniel noted in the test. In fact, the test sets did have a few "reverse connections" (e.g. holder to issuer) so we were already doing that a bit of that, but now we have all the tests running each way. Note in that in doing that, a number of AF-.NET / ACA-Py tests are not working and we're investigating.
- Term: Runset -- a daily run of a given configuration -- named, set of tests, specific test agents in each role
- Term: Test Agent -- a container with a backchannel and a component it is controlling that is being tested. Currently all Test Agents are frameworks, but a standalone agent (e.g. controller+framework), or other configurations (e.g. mediator+framework) are possible.
- Because we have so many runsets running daily, we reorganized the info and details pages to be "Test Agent" oriented. The new table shows at it's intersections all the tests run for the given pair of Test Agents. The details pages are now per Test Agent and show the details of each runset in which that agent participates. There remains a subjective narrative about the current runset status.
- We added for each Test Agent in a runset, the self0-reported version of the component under test.
Mobile Testing
As well, @ianco did some very cool work to make a mobile testing possible. There is documentation there, but to try it, you can do the following, assuming you have git, docker and ngrok handy on your system.
git clone https://github.com/bcgov/indy-tails-server
cd indy-tails-server/docker
./manage build
./manage start
cd ../..
git clone https://github.com/hyperledger/aries-agent-test-harness
cd aries-agent-test-harness
./manage build -a acapy-master -a mobile
LEDGER_URL_CONFIG=http://test.bcovrin.vonx.io ./manage run -n -d acapy-main -b mobile -t @MobileTest
Pull out your phone with an Aries/Indy mobile wallet on it and start scanning QR codes in the terminal, and following the instructions in the app to accept credentials and present proofs.
We're going to work at seeing how we can improve the mobile run capability so that we can use them as "runsets" and keep a history of them for specific mobile agents.
As always, feedback is welcome. I'll post this material on Rocketchat so people can see more about this work and start to get involved.
@jonathan-rayback -- would love to see an Evernym SDK Test Agent in this repo -- or using a private fork of this repo with the results reported back.
FYI -- I had to edit the instructions for running mobile in my previous comment, so heads up if you are trying to run from the email version of the comment.
Also, the success of scanning the QR code depends on the combination of phone and screen. I've got an idea for a better way to do that...
Enjoy!