[CERT-TEST-FAILURE] Test_TC_ACL_2_3/4 are very (too?) project-chip implementation specific
The test cases for ACL (stareting with ACL2_3 and 2_4, but also other, are build very implementation specific to "how project chip" has implemented functionalities like chunked list writes and also ACL list even processing and sending. The tests currently relying on the facts that:
- writes are always process in exactly the chunks that chiptool sends in and each chunked array change is processed for it's alone inluding validation and event sending for the whole ACL list (like also resetting the event sends all former entries as deleted). The counter example would be a logic that collects all list chunks from one write request and processes them as one list change - but this would lead to completely different events in several cases
- acl list change processing is validated implementation specific as chip has implemented it. (see https://github.com/project-chip/connectedhomeip/issues/33487#issuecomment-2118329316) ... so an other implementation, e.g. processing the deletions also "in increasing order" and not decreasing would also lead to a different event order
- even if chip has support for "unchunked lists" in ACl (which would also behave differently as the chunked variant regarding sent events and such) this functionality is never really tested with these tests because chiptool always send list in a chunked way (which formally is against the specs recommendation - yes only "SHOULD", but would still be better regarding the amount of data to be transported over the network especially also with missing tag compression ;-) ...)
This implementation specific tests makes it difficult for alternative implementations, unless all these details would be part of the specification and so become requirements. Sure with the right argumentation vendors using alternative implementations could request exceptions from tests, but it adds a difficulty for alternative implementations of the standard.
So I will leave this open as a "Cert blocker" here and I'm available to chat about the topic.
If I missed specification details where some of the topics are defined then please provide details.
Original issue context
Feature Area
Other
Test Case
Test_TC_ACL_2_3/4
Reproduction steps
Test ACL 2.3: Formally the last successful write is in "step 9" of the test ... but Step 18 expects another value.
It seems Step 18 expects the "first of the second entries" from step 17 ... which gave an ConstraintError because tried to write 2 entries while only one is allowed.
I assume that this is because of the chunked write handling added in https://github.com/project-chip/connectedhomeip/blob/master/src/app/clusters/access-control-server/access-control-server.cpp#L353 which is very chip specific because formally there is no requiremenst that I know that any array needs to be transported in a chunked way ... and also not that writes needs to be handled chunked at all.
In fat it is really unexpected in case of acl (and especially in case of ACL where consistence should be the most important topic) that a write command like in step 17 returns an error, BUT had an effect by writing the first of the two chunked parts ...
It would be great to get some background information for this, thank you!
In general the test will also have issues as soon as a controller implementation is not sending the list "always chunked" as chip does, because the logic in the access-control-server.cpp would return an Constraint error without saving any value when it is >1 entry in the array. So this behavior is not consistent.
The same topic seems to be contained in Test 2.4 - here for ACL entries themself in Step 29/30 https://github.com/project-chip/connectedhomeip/blob/master/src/app/tests/suites/certification/Test_TC_ACL_2_4.yaml#L1068
Bug prevalence
always
GitHub hash of the SDK that was being used
master
Platform
core
Anything else?
PS: matter.js would be such an controller which is not automatically chunks every value. We try to get the array unchunked into the data report and only if this does not fit we fallback to chunk the array.
Follow up question aout behavior because the tests only do two entries. What should actually happen when someone wrtes 3 entries - first valid, second add invalid, third add valid ... Then I would assume with the current logic that the second entry in the target list will have the value from the third chunk, right? Whats the reponse? Still ConstraintError because the second entry had that? Or should the third entry "addition" not be processed because the second failed? Would be cool to get some insights here
@mlepage-google @hasty
Ok the more I look into it the more it becomes clear to me that the expectation with chunked writes is to really process them also in their chunks one by one. Matter.js is currently not doing this that way. I can change that.
With this the question (that is unanswered for me also after consulting specs - or I am too blind) I added above is the one left: If on one chunk an error happens are more chunks for the same path also processed or not? I assume yes. In fact the result would be then one response per chunk, e.g. one ok, one error and one ok ... and it is up to the caller to decide whats now up ;-) Correct?
Thank you for your support.
Another follow up question: how exactly data version protected writes work with chinked writes? On acl cluster the specs tell the administrator to ideally protect writes with data version filters. So… when each chunked part is handled as own write then dataversion changes after each. So after first write that will change unless the controller makes assumptions on how be re ion increases. Is that also correct? So data version protected writes should never be chunked ?!
Hello Appollon.
Some clarifications...
First, the extension attribute is optional and if present, must be of length 1. That is why there is a constraint error when attempting to write a second element to that list.
The ACL attribute is mandatory, and must support at least 3 entries per fabric. Therefore, writing 3 entries will never result in a constraint error due to exhausting list length; writing 4 or more entries might (depending on actual supported length) result in such errors, but they would always be at the end, and never have a successful entry written after exhausting the supported length.
An ACL entry may fail other data validity checks, and fail to write for that reason.
In the SDK, when a list is written (say with 3 elements), it is done as follows:
- List cleared via list replace (with empty list) action.
- First element added via list item append action.
- Second element added via list item append action.
- Third element added via list item append action.
In that case, if the second item resulted in a failure (e.g. due to invalid item data), the list would still be of length 1 after having attempted to append the second item (and failed). Processing would continue, and the third item would be appended, resulting in a list of length 2. It is as if only the two valid items were sent, except that a write error occurred for the second item.
All right. Thank you for confirmation. In fact this is then also no testing issue and could be closed or moved.
Nevertheless the specification lack details on chunked write operations and how this should work. It was completely unclear to me till now.
In my eyes this behavior is maximum inconsistent because if I send an unchunked array with the 3 items where second is invalid then nothing is written and the constraint error happens for all three. But unassigned that this ship has sailed ;-)
Would you also have an info how chunked writes work with dataversion check on the write? (Also because acl specs propose to use that). In my understanding each processed chunk increase the version so the data version check only works for the first chunk. I know that chip always send array full and so resets the array with the first chunk but the protocol does not dictate that. So someone could really just add items to an existing list. ;-)
Ps: (and slightly off topic) is there a reason why Chip always chunks the data in data reports and writes? If the array would fit into the data report then sending it unchanged would need less bytes then always chunked. Would that not be important for low power devices?
But yes then the test here would get issues because they could behave differently ;-))
The CHIP SDK always chunks writes primarily for implementation simplicity, if I recall correctly.
Regarding the question of how exactly DataVersion works with chunked writes, it's likely the DataVersion increment happens when the last chunk is received, but I'm not sure I recall exactly.
Spec section 10.6.4.7. Collection Types (List) seems to have an example showing a list write as both one IB (interaction block) and multiple IBs (the multiple replace/append blocks I mentioned earlier), each with DataVersion=1, so that would match that behaviour.
The CHIP SDK always chunks writes primarily for implementation simplicity, if I recall correctly.
Ok. Still … in regard to best possibly minimizing traffic that’s maybe nor the best choice. I remember matter 1.3 announcement also contained that less data are transferred as an optimization… that here could be another low hanging fruit ;-) and trust me it is not that complex as it feels. Matter.js does that. ;-)
Additionally now that I think about it - it has the consequence that the current tests might now test all relevant code paths … as example in acl code is there for both options but the „full array“ code path should always ever be executed with zero or one element …
…. Just as thoughts from my opinion.
Re your infos on dataversion: ok thank you for the reference. Puuhh that makes things interesting.
So we have kind of a „transaction“ (summarize multiple changes and „commit“ at the end to increase version) but without having a real „transaction“ (aka rollback everything in case of an error).
And all this over multiple potentially chunked write messages ;-)
Good. Will be fun to implement that. ;-)
When I think deeper then I also ask myself why the spec 1.3 in "10.6.4.3.1. Lists" just tals about "full list" or "clear the list and send appends" ... Combined with a dataVersion security approach it should be easiely possible to just send list modifications. Also the spec examples kind of contain all these examples, just the spec tells is "SHOULD" not be done ... This should be even less bytes needed and should be (because of the dataVersion security guard) end in the same final list ... or?! But yes thats more "spec and protocol feedback" and I have no idea if chip has implemented that ... I assume not becuse eg acl just handled additions beside "full array" and last time I checked in the c++ code I only saw places that handle these two cases, nothing else... Sorry to place it here, but I have no other ways to bring such thoughts/questions to any attention ;-)
… as example in acl code is there for both options but the „full array“ code path should always ever be executed with zero or one element …
In access-control-server.cpp for writing the Extension attribute with an entire list, it only ever handles a list of length 0 or 1 because that attribute list can only ever be length 0 or 1.
https://github.com/project-chip/connectedhomeip/blob/22b9c96b691d3bb5a5d3db510db353b3b2bc2f74/src/app/clusters/access-control-server/access-control-server.cpp#L350
For the ACL attribute, the length of the list being written is not assumed.
https://github.com/project-chip/connectedhomeip/blob/22b9c96b691d3bb5a5d3db510db353b3b2bc2f74/src/app/clusters/access-control-server/access-control-server.cpp#L252
I did most adjustments needed and are currently finalizing all the tests but I need to tell you that the tests here (especially ACE 2_5, 2_6 and such) are veeery chip implementation specific. I now need to mimic the event order based on the code how chip is managing the entries in the relevant cases and send out events. Also the events have the "latestValue" which is not really described in the specs, but in the end is a different value depending on if it is a Removal or other action. All this is nowhere mentioned in the specification. In the end thats all valid and ok to require it, but if such things like the exact event behaviour, orders and such are checked in certification in this depth then also the specification should contain these implementation details as requirements. Thats at least my opinion :-)
In fact I spentsome hours today to just run tests, analyze why it broke, adjust my code even if specs did not contain these details, and do this in loops ;) Same for the specific write request handling for "splitted/chunked list processing". The specification (at least not clear to me) is not containing all the details whats required and tested. I think all details in specification would have saved me a lot of this time.
Please see this as feedback from a developer that implements the protocol mainly after specs (as it should work out).
PS: just as example what I mean with implementation specific: see https://github.com/project-chip/connectedhomeip/blob/f4e02d4373e1ad01e8967d721112caf383b3c1f7/src/app/clusters/access-control-server/access-control-server.cpp#L251-L272
New values are added up-counting 0,1,2 ... old values then removed down-counting 5,4,3 ... and tests check exactly this behavior and order of events ;-)
PS: If someone is interested I could try to collect all the things I would love to get added to the specs and post here, (at least collecting the topics)
In addition to above I also see a spec issue here ...
Spec tells in 6.6.4. Access Control Cluster update side-effects
Updates to the Access Control Cluster SHALL take immediate effect in the Access Control system.
and then having the example of "1. Action: Write (1st path)" ... beside the fact that I would not know any real way to do a write to "just a sub field" currently this is strange: So if i reframe that example to the current chunking-lists realitythen the first write path action is the complete reset of the array to be empty! so the first real value gets added in the second Write action. So if this would become effective really directly then all other acl actions will indeed fail because the acl list is empty.
I know that this example wants to show how a "when changing acl and doing another write action afterwards to an non acl attribute" scenario should work this is still leading me to the above question and more assumptions. Like "immediately effective" could now mean to just get effective as soon as the attribute path changes from acl to any other path - that would at least solve the above issue with "resetting becomes effective immediately" ...
So it would be great to know how this should be really handled with all chunked writes. Does ACL list changes really become effective directly, but "ignoring" the initial write action to clear the list? Or is the new entry just effective after all writes are done? In both cases specification would require an update in my pov.
Also in general this "proceed the writes entries one by one" also could have effects in e.g. bindings or groups. In both cases lists are used to handle things. If we also here clear the lists and then "write one by one with immediately becoming effective" means that if I have 5 groups/bindings and I add a sixths then the first write clears the list and then the first entry gets added and formally all others removed. if this becomes effective directly too that would mean we stop listening to all these groups or disconnect from all bound devices except the first and then we rebuild all these connections and listening sockets again newly and add one more. or am I wrong or miss a point?
Ok, I verified one of the things by experimenting with chip: It seems to me that a "write to the acl attribute" is processed completely in the write message and becomes effective then. This is the reason that clearing the array as first write path action or adding acl entries in strange order is now screwing up the whole acl system. So with this assumption is that chunked list writes are processed as a sequence in chip until the "write path" (regarding endpoint/cluster/attribute) changes and are then being updated in the acl system. At least so it seems to me.
@Apollon77 can we close this issue ?
In my eyes not tbh. Because we were in the middle of a certification we now adjusted matter.js to process list writes the exact same way as chip does by never writing the full list but always "chunking" it. If I would change this back to the original way to write the full list at once then by the defined logic of the "how events are needed to be sent defined by spec" we would have different events triggered.
Example:
- currently: the first event is always an update of the first element, followed by delete events (in backwards order which is also nowhere defined in the specs, but implemented that way by chip) for all other former list entries, then it generates more add-events.
- with a read full list write there would be update events for all "overlapping" list indices and then add or delete events for the rest depending on if the new list is longer then the old list
As you can see, based on the event logic defined by spec these two "way to process lists" is different and will cause a different order of events by nature as soon as the list is longer then 1 entry. Additionally the fact that add/update cases are processed in ascending order, but delete cases are process in descending order is undefined by the specs, so also the order of delete events itself would be undefined, but the yaml tests are explicitly requiring them to be in "the" order that chip implements.
So yes - because of several "workarounds" and "codeflow like chip" in matter.js - it is formally no cert blocker for matter.js devices, but mainly because we did not wanted to have this discussing within the certification process. But as soon as another implementation implements the same (e.g rust) (or we would change that back to original logic in matter.js) the same issue could arise again.
Meaning, in my eyes, if the current behavior and tests should stay as is, this would require some spec adjustments to define the chip behavior as "the one" - Including clear requirements that lists always need to be processed that way and such things and also clarifying how "full list" need to be handled because currently chip code may also lead to different results on errornous entries in the list (I am personally unsure here because I do not knoe what the code exactly does and what would be the result in these cases). So other options would be to adjust the tests in a way that relaxes the checks in a way that things work or are ok with all possible implementation variants
Basically this topic also touches the whole topic that chip choose to send "lists always chunked" also if they would match into one message ...
PS: Sure it can be decided as "works for now" and the mentioned topics being accepted ... but if you ask me personally, I would love to not close it :-)
Ok, I thought a bit on how we could ideally discuss this with a chance to come to solutions and not to run in circles ;-)
In my eyes the very first topic is: If there is an agreement on the fact that normal list write interaction are always done chunked as chip sdk is sending them currently and this stays that way forever that would simplify a lot. So that would be the very first topic to discuss.
- if there is an agreement then this should be simply added to the spec as a "SHALL" requirement and 80% of the topic is solved basically. And formally that's just an "editorial" because noone was able to certify a device without this behaviour because else the above mentioned ACL tests would all have failed. When this is decided it would also be a great idea to reflect this in chip code and just support this case and error if someone sends something different to be clear here.
- if this is no agreement then we have a funny discussion :-)
When the above topic is agreed only one main topic is left: order of delete events in the ACL case which relies on the sdk implementation where deletions are processed from the highest list index down to 0. If this is the way to go for ACL, define it in the specs and we are also done here. Or we should think about how we could enhance the yaml tests to just verify event existence and not the order :-)
(side note: the acl yaml tests rely on the fact that the latest acl events are still readable ... which (as I understood) might not be the case, so also test could fail if too many events are generated, but thats other topic)
Closing in favor of
https://github.com/CHIP-Specifications/chip-test-plans/issues/4810 and https://github.com/project-chip/connectedhomeip/issues/37204