Confirmed commit behavior is not RFC compliant

Reference files

a-test.yang

module a-test {
    namespace "http://test.com/yang/";
    prefix test;
  container contain-1 {
    list data {
      key "name";
      leaf name {
        type string;
      }
      leaf value {
        type string;
      }
    }
  }
}

input1.xml

<contain-1 xmlns="http://test.com/yang/">
  <name>key</name>
  <value>first_value</value>
</contain-1>

input2.xml

<contain-1 xmlns="http://test.com/yang/">
  <name>key</name>
  <value>second_value</value>
</contain-1>

input3.xml

<contain-1 xmlns="http://test.com/yang/">
  <name>key</name>
  <value>third_value</value>
</contain-1>

Case 1 | Second confirmed commit before first times out

From RFC6241

If a follow-up confirmed operation is issued before the timer expires, the timer is reset to the new value (600 seconds by default). Both the confirming commit and a follow-up confirmed operation MAY introduce additional changes to the configuration.

The second confirmed commit has to be treated as an update to the values of an existing confirmed commit. It must not overwrite the backup values set by the original confirmed commit operation.

It seems every time a confirmed commit is received, the server creates a backup of the current running configuration. However, in this case, when the second commit is issued, no new backup must be done, since it is still the same transaction (the first confirmed commit was not confirmed by a confirming commit).

> edit-config --target candidate --config=input1.xml
> commit

> get-config --source candidate (at this point "key" has the value "first_value")
> get-config --source running (at this point "key" has the value "first_value")

> edit-config --target candidate --config=input2.xml
> commit --confirmed --confirm-timeout 600

> get-config --source candidate (at this point "key" has the value "second_value")
> get-config --source running (at this point "key" has the value "second_value")

> edit-config --target candidate --config=input3.xml
> commit --confirmed --confirm-timeout 10

> get-config --source candidate (at this point "key" has the value "third_value")
> get-config --source running (at this point "key" has the value "third_value")

wait 10 seconds...

> get-config --source candidate
> get-config --source running

Expected: both candidate and running have the value "first_value". Actual: both candidate and running have the value "second_value".

Case 2 | Rollback is not happening for session terminated (confirmed commit without persist)

From RFC6241

If the session issuing the confirmed commit is terminated for any reason before the confirm timeout expires, the server MUST restore the configuration to its state before the confirmed commit was issued, unless the confirmed commit also included a element.

> edit-config --target candidate --config=input1.xml
> commit

> get-config --source candidate (at this point "key" has the value "first_value")
> get-config --source running (at this point "key" has the value "first_value")

> edit-config --target candidate --config=input2.xml
> commit --confirmed --confirm-timeout 600

> get-config --source candidate (at this point "key" has the value "second_value")
> get-config --source running (at this point "key" has the value "second_value")

> disconnect
> connect
> get-config --source candidate
> get-config --source running

Expected: both candidate and running have the value "first_value". Actual: both candidate and running have the value "second_value".

The value will eventually go back to "first_value" when the timer expires but it should already been reverted since the connection was closed without issuing a confirming commit. The same result is obtained if you use, from another session, the command kill-session to kill the session that issued the confirmed commit.

Case 3 | Concurrent sessions issuing a confirmed commit are not resulting in error

From RFC6241

If the element is not given in the confirmed commit operation, any follow-up commit and the confirming commit MUST be issued on the same session that issued the confirmed commit.

(But, this is also mentioned:)

For shared configurations, this feature can cause other configuration changes (for example, via other NETCONF sessions) to be inadvertently altered or removed, unless the configuration locking feature is used (in other words, the lock is obtained before the operation is started). Therefore, it is strongly suggested that in order to use this feature with shared configuration datastores, configuration locking SHOULD also be used.

session 1
> edit-config --target candidate --config=input1.xml
> commit

> get-config --source candidate (at this point "key" has the value "first_value")
> get-config --source running (at this point "key" has the value "first_value")

> edit-config --target candidate --config=input2.xml
> commit --confirmed --confirm-timeout 600

> get-config --source candidate (at this point "key" has the value "second_value")
> get-config --source running (at this point "key" has the value "second_value")

session 2
> commit --confirmed --confirm-timeout 600

Expected: the session 2 confirmed commit must result in an error. Actual: the session 2 confirmed commit works just as a normal confirmed commit.

Case 4 | Lock is preventing the rollback from happening

From RFC6241

A confirmed operation MUST be reverted if a confirming commit is not issued within the timeout period (by default 600 seconds = 10 minutes).

> lock --target running
> lock --target candidate

> edit-config --target candidate --config=input1.xml
> commit

> get-config --source candidate (at this point "key" has the value "first_value")
> get-config --source running (at this point "key" has the value "first_value")

> edit-config --target candidate --config=input2.xml
> commit --confirmed --confirm-timeout 5

> get-config --source candidate (at this point "key" has the value "second_value")
> get-config --source running (at this point "key" has the value "second_value")

wait 5 seconds...

> get-config --source candidate
> get-config --source running

Server messages:

...
[ERR]: SR: Module "a-test" is DS-locked by session 225.
[ERR]: NP: Failed restoring backup for module "a-test".

Expected: both candidate and running have the value "first_value". Actual: both candidate and running have the value "second_value".

@michalvasko , do you agree these behaviors are not RFC compliant? Maybe my interpretation was not correct so that's why I'd like to confirm. Thank you

May 03 '22 20:05 hcjedi

Everything was tested using the latest devel branch of the stack.

May 03 '22 20:05 hcjedi

1), 2), 3) I agree that it was not working properly and it should now, all your use-cases added into tests. 4) I am not sure what to do about this case, I can now quote text about <lock>

      When the lock is acquired, the server MUST prevent any changes to
      the locked resource other than those requested by this session.
      SNMP and CLI requests to modify the resource MUST fail with an
      appropriate error.

We can talk about how to interpret "requested by this session" but I do not think commit timeout rollback is requested by any session. The previous configuration is actually still available and hence can theoretically be restored manually but will not be automatically. I think this is acceptable and is essentially a problem on the side of the user.

May 04 '22 13:05 michalvasko

Thank you for the quick reply and fix @michalvasko. Case 1, 2 and 3 all fixed now, retested with the latest changes.

Regarding case number 4, I raised the point you mentioned with my teammates and we understand when you say that the rollback does not belongs to any session. But by a user point of view, we believe that it makes more sense if the rollback indeed happens, as long as the rollback is due to a confirmed commit issued by the session which has the lock. In the case a session acquires the lock while the confirmed commit timer was counting for an operation issued in another session, then we can expect the rollback to fail. Let us know your opinion considering this.

May 05 '22 20:05 hcjedi

Let us know your opinion considering this.

I have already shared my opinion. I am not sure it makes sense to change locked data without the session requesting it (again left for interpretation). Also, there is an inconsistency when the change would be considered made by the session if it is still running and not if, for example, the server has crashed and is restoring the configuration after restart. If this change is so important to you, write to the NETCONF mailing list explaining the problem and I will implement whatever is agreed on.

May 06 '22 08:05 michalvasko

I suggest to run the rollback in the context of the obtained lock on the running datastore as much as possible - which at first glance looks like ignoring it, I'm not sure if there is a situation where there is more to it than just ignoring it.

The reasoning is somewhat lengthy, the key is Section 7.5 which disallows to grant a lock on the running datastore if a confirmed commit (persistent or non-persistent) is ongoing.

i) Non-persistent Section 8.4.1 If the .persist. element is not given in the confirmed commit operation, any follow-up commit and the confirming commit MUST be issued on the same session that issued the confirmed commit.

-> run the rollback in the context of the originating session. Do so even on termination before possibly giving up the lock automatically. If a lock has been obtained by that session, use this lock. Even if no lock has been obtained it is not possible for another session to hold the lock (see quotation from §7.5. below for reasons why this is not possible)

ii) persistent Section 8.4.1 If the .persist. element is given in the confirmed .commit. operation, a follow-up commit and the confirming commit can be given on any session, and they MUST include a .persist-id. element with a value equal to the given value of the .persist. element.

-> the originating session may or may not exist at the time of rollback

Section 7.5. A lock MUST NOT be granted if any of the following conditions is true: The target configuration is .candidate., it has already been modified, and these changes have not been committed or rolled back.

  The target configuration is .running., and another NETCONF
  session has an ongoing confirmed commit (Section 8.4).

-> a 'foreign' session cannot possibly lock the datastore if a rollback is possible later

-> if the originating session exists and holds the lock on running then 'join' that lock and perform rollback under that lock

-> if another session held the lock at the time of confirmed-commit start then that commit must have failed and no rollback is expected later

-> even if the originating session does not hold the lock no-one else could possibly have obtained that lock, in this case simply continue the rollback and no lock should hinder the action.

May 06 '22 12:05 gwendlan

You do not understand, such behavior is quite non-standard and constitutes practical challenges. Let's say the commit is being rolled back, it can take a lot of time with all sysrepo callbacks being called. What if during this time an edit-config comes on the session, what exactly should happen? Speaking specifically, sysrepo does not allow using its sessions concurrently in separate threads (for good reason) so I would have to manually lock it, what else. Meaning the user may execute any RPC while the roll back is in process and would need to wait until it finishes, which is hardly expected.

I think commit rollback is a completely asynchronous operation and hence you cannot treat it as being performed by any session.

May 06 '22 12:05 michalvasko

RFC 6241, § 8.4.1: A confirmed operation MUST be reverted if a confirming commit is not issued within the timeout period (by default 600 seconds = 10 minutes). .... For shared configurations .... it is strongly suggested that in order to use this feature with shared configuration datastores, configuration locking SHOULD also be used.

Following this suggestion it is impossible utilise the timeout mechanism.

May 10 '22 10:05 gwendlan

it is strongly suggested that in order to use this feature with shared configuration datastores, configuration locking SHOULD also be used.

When using locks I would expect you know exactly what is going to happen and for how long you are going to be holding the lock. So in the confirmed commit you specify some timeout just to be safe in case something goes wrong. But normally you either confirm or cancel the commit before the timeout and release the lock.

A confirmed operation MUST be reverted...

If taken literally, this is an impossible requirement. We are talking about rolling back potentially huge chunk of configuration and lots of things can fail for such an operation. A reasonable interpretation would be "confirmed operation MUST be attempted to be reverted" which would simply mean that no matter what happens, the roll back process must be initiated after the timeout elapses. That is all we have tried to implement, anyway.

May 10 '22 12:05 michalvasko

netopeer2 netopeer2 copied to clipboard