nftables Set the rule handle after flush

Currently this very basic and useful workflow is not possible:

r1 := c.AddRule(&nftables.Rule{
	Table: filter,
	Chain: prerouting,
	Exprs: []expr.Any{
		&expr.Verdict{
			// [ immediate reg 0 drop ]
			Kind: expr.VerdictDrop,
		},
	},
})

c.Flush()

c.DelRule(r1)

c.Flush()

The solution: Because of NLM_F_ECHO all netlink messages in the transaction are returned back. In the case of rules, there are returned back with Handle. Rules are linked to requests by keeping the index of netlink request of a rule. We rely on sequence number to retrieve the right request when receiving a netlink response.

What does the solution allow?

Manipulating rules without fetching and processing rules before
Adding comment to rules, because we don't need user data to retrieve rules anymore

Dec 26 '19 11:12 alexispires

@alexispires check out this library github.com/sbezverk/nftableslib it is based on github.com/google/nftables and it allows to do easily what you need.

Dec 26 '19 15:12 sbezverk

I think this change would break 1 important use case, when Flush() is called only once for everything, chains, rules, sets etc. We have a use case when everything gets prepared and then programmed in a single shot. I do not see how proposed implementation would address such case.

Dec 26 '19 19:12 sbezverk

Look at my test function, I send 1 table, 1 chain and 2 rules in a single transaction

Dec 26 '19 19:12 alexispires

When there are several chains, multiple rules how do you insure correspondence with returned handle and rule? It seems you rely only on sequence, I do not think it is reliable enough. Still it is better to extend functionality which ensures backward compatibility and not change with breakage.

Dec 26 '19 19:12 sbezverk

I understand your concern. The same problem appeared to nft tool and the same assumptions were made, which makes them acceptable to me.

Look at this commit : http://git.netfilter.org/nftables/commit/?id=bb32d8db9a125d9676f87866e48ffbf0221ec16a

The basic principle is to not return a JSON object freshly created from netlink responses, but just update the existing user-provided one to make sure callers get back exactly what they expect. To achieve that, keep the parsed JSON object around in a global variable ('cur_root') and provide a custom callback to insert handles into it from received netlink messages. The tricky bit here is updating rules since unique identification is problematic. Therefore drop possibly present handles from input and later assume updates are received in order so the first rule not having a handle set is the right one.

The proposal to extend the functionality bothers me since it introduces a behavior specific to the go implementation. The notion of Handle is already used to identify the rules and the echo function exist mainly to return the Handle.

Look at this commit too: http://git.netfilter.org/nftables/commit/?id=b99c4d072d9969f7a0dfc539b2b68b517f90af68

Dec 27 '19 16:12 alexispires

The work you quoted geared towards less dynamic and more static implementations, imho. My focus is in dynamic, concurrent use uses, example kube-proxy using nftables. Here is scenario when this approach pretty much guarantees the issue, there are 2 processes handling ipv4 and ipv6 tables concurrently, there is a possibility that ipv4 adn ipv6 rules are put in the same messages queue for Flush to program, if you cannot identify precisely which rule handle corresponds to which rule, then it is a matter of chance before they get mixed up. I think using user defined field to be able to identify a particular rule and its id is a safe and reliable way covering all possible use cases, well at least none of use cases presented in the past were not covered by this approach.

Dec 27 '19 16:12 sbezverk

The same problem appeared to nft tool and the same assumptions were made, which makes them acceptable to me.

It is a public and shared library, what is acceptable to you might not work for others, let's find a solution which is not breaking the current behaviour and providing you with what you need for your use case.

Dec 27 '19 17:12 sbezverk

I’m not sure yet whether we should support this behavior.

For robustness, programs should generally do idempotent operations where possible. Changing the program from its current “modify/modify” approach to a “modify, query/modify” approach lends itself better to making the program work correctly when interrupted.

Can you share some more details regarding what you’re trying to do?

Jan 01 '20 15:01 stapelberg

Today there's only two way to delete or replace a rule with this library:

Using UserData and put a custom userland id (not standard), then fetch all rules and retrieve it
Fetch all rules and compare field by field to retrieve the right rule

IMO the right way is to get back the handle after flush. I can't tell much about what I'm trying to do (commercial product) but it's like a SDN application. I store in memory inside a container (so not problem with interruption it's almost stateless) all my rules to delete them after.

I'm trying to code it in a strong way by matching the sequence number of request (rule by rule) with sequence of the response wich contain handle. Maybe it can be stronger.

Jan 02 '20 09:01 alexispires

@stapelberg @sbezverk I'm trying to improve my loop control and I'm not sure about the best way to accomplish it. By using strace I see that nftables do a select syscall to detect if a netlink message is going to be received. Do you think this is the right way ?

select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=0}) = 1 (in [3], left {tv_sec=0, tv_usec=0})
recvmsg(......)
select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)

Jan 03 '20 09:01 alexispires

I made a proto with select, what do you think?

conn_linux.go (from github.com/mdlayher/netlink)

func (c *conn) Select() (int, error) {

	var fdSet unix.FdSet
	fdSet.Zero()
	fdSet.Set(c.s.FD())

	n, err := unix.Select(c.s.FD()+1, &fdSet, nil, nil, &unix.Timeval{})

	return n, err
}

conn.go (from github.com/google/nftables)

        smsg, err := conn.SendMessages(batch(cc.messages))

	if err != nil {
		return fmt.Errorf("SendMessages: %w", err)
	}

	// Retrieving of seq number associated to entities
	entitiesBySeq := make(map[uint32]Entity)
	for i, e := range cc.entities {
		entitiesBySeq[smsg[i].Header.Sequence] = e
	}

	// Trigger entities callback
	for i := 1; i > 0; i, _ = conn.Select() {
		rmsg, err := conn.Receive()

		if err != nil {
			return fmt.Errorf("Receive: %w", err)
		}

		for _, msg := range rmsg {
			if e, ok := entitiesBySeq[msg.Header.Sequence]; ok {
				e.HandleResponse(msg)
			}
		}
	}

Jan 04 '20 01:01 alexispires

I just added a more robust test suites (4096 rules added in // by 16 workers) with assert which rely on UserData to check Handle value.

Jan 06 '20 14:01 alexispires

Cleaner version below, but both works and are reliable.

diff --git a/conn.go b/conn.go
index 6a552d8..34cf0b9 100644
--- a/conn.go
+++ b/conn.go
@@ -17,7 +17,6 @@ package nftables
 import (
 	"fmt"
 	"sync"
-	"sync/atomic"
 
 	"github.com/google/nftables/expr"
 	"github.com/mdlayher/netlink"
@@ -41,7 +40,7 @@ type Conn struct {
 	sync.Mutex
 	put      sync.Mutex
 	messages []netlink.Message
-	entities map[int32]Entity
+	entities map[int]Entity
 	it       int32
 	err      error
 }
@@ -52,7 +51,6 @@ func (cc *Conn) Flush() error {
 	defer func() {
 		cc.messages = nil
 		cc.entities = nil
-		cc.it = 0
 		cc.Unlock()
 	}()
 	if len(cc.messages) == 0 {
@@ -71,7 +69,7 @@ func (cc *Conn) Flush() error {
 
 	cc.endBatch(cc.messages)
 
-	_, err = conn.SendMessages(cc.messages[:cc.it+1])
+	_, err = conn.SendMessages(cc.messages)
 
 	if err != nil {
 		return fmt.Errorf("SendMessages: %w", err)
@@ -104,36 +102,29 @@ func (cc *Conn) Flush() error {
 }
 
 // PutMessage store netlink message to sent after
-func (cc *Conn) PutMessage(msg netlink.Message) int32 {
+func (cc *Conn) PutMessage(msg netlink.Message) int {
 	cc.put.Lock()
 	defer cc.put.Unlock()
 
 	if cc.messages == nil {
-		cc.messages = make([]netlink.Message, 16)
-		cc.messages[0] = netlink.Message{
+		cc.messages = append(cc.messages, netlink.Message{
 			Header: netlink.Header{
 				Type:  netlink.HeaderType(unix.NFNL_MSG_BATCH_BEGIN),
 				Flags: netlink.Request,
 			},
 			Data: extraHeader(0, unix.NFNL_SUBSYS_NFTABLES),
-		}
-	}
-
-	i := atomic.AddInt32(&cc.it, 1)
-
-	if len(cc.messages) <= int(i) {
-		cc.messages = resize(cc.messages)
+		})
 	}
 
-	cc.messages[i] = msg
+	cc.messages = append(cc.messages, msg)
 
-	return i
+	return len(cc.messages) - 1
 }
 
 // PutEntity store entity to relate to netlink response
-func (cc *Conn) PutEntity(i int32, e Entity) {
+func (cc *Conn) PutEntity(i int, e Entity) {
 	if cc.entities == nil {
-		cc.entities = make(map[int32]Entity)
+		cc.entities = make(map[int]Entity)
 	}
 	cc.entities[i] = e
 }
@@ -214,23 +205,14 @@ func (cc *Conn) marshalExpr(e expr.Any) []byte {
 
 func (cc *Conn) endBatch(messages []netlink.Message) {
 
-	i := atomic.AddInt32(&cc.it, 1)
-
-	if len(cc.messages) <= int(i) {
-		cc.messages = resize(cc.messages)
-	}
+	cc.put.Lock()
+	defer cc.put.Unlock()
 
-	cc.messages[i] = netlink.Message{
+	cc.messages = append(cc.messages, netlink.Message{
 		Header: netlink.Header{
 			Type:  netlink.HeaderType(unix.NFNL_MSG_BATCH_END),
 			Flags: netlink.Request,
 		},
 		Data: extraHeader(0, unix.NFNL_SUBSYS_NFTABLES),
-	}
-}
-
-func resize(messages []netlink.Message) []netlink.Message {
-	new := make([]netlink.Message, cap(messages)*2)
-	copy(new, messages)
-	return new
+	})
 }

Jan 08 '20 13:01 alexispires

@alexispires could you please demonstrate that with this approach you can create rules with a reference to anonymous set. The reason for this request is the fact that creation of anonymous set must happen within the same transaction of a rule creation. It is not clear to me if it works with this solution.

Jan 08 '20 16:01 sbezverk

@alexispires could you please demonstrate that with this approach you can create rules with a reference to anonymous set. The reason for this request is the fact that creation of anonymous set must happen within the same transaction of a rule creation. It is not clear to me if it works with this solution.

I'm not sure to understand, there are tests for this case (TestCreateUseAnonymousSet), you mean if handle ref is still retrieved ?

Jan 08 '20 16:01 alexispires

@alexispires The test case only tests the sequence of bytes that gets sent, not the actual kernel/netfilter programming, unfortunately travis does not allow this type of "e2e" tests. I was asking you a favour, to build an actual simple binary, based on your change that programs a rule with a reference to an anonymous set and then run 'nft list table {your table name}' to confirm it.

Jan 08 '20 16:01 sbezverk

@sbezverk ok no problem, i'll come back to you

Jan 08 '20 16:01 alexispires

It looks good:

Before flush rule Handle is 0
After flush rule Handle is 3
table ip filter {
        chain nfpoc {
                type filter hook postrouting priority 0; policy accept;
                tcp dport { tftp, 1163 } drop # handle 3
        }
}

Code (also in my github, with binary included for x64: https://github.com/alexispires/nfpoc):

package main

import (
	"fmt"
	"os/exec"

	"github.com/google/nftables"
	"github.com/google/nftables/binaryutil"
	"github.com/google/nftables/expr"
	"golang.org/x/sys/unix"
)

func main() {

	c := &nftables.Conn{}

	table := c.AddTable(&nftables.Table{
		Family: nftables.TableFamilyIPv4,
		Name:   "filter",
	})

	chain := c.AddChain(&nftables.Chain{
		Name:     "nfpoc",
		Table:    table,
		Type:     nftables.ChainTypeFilter,
		Hooknum:  nftables.ChainHookPostrouting,
		Priority: nftables.ChainPriorityFilter,
	})

	set := &nftables.Set{
		Anonymous: true,
		Constant:  true,
		Table:     table,
		KeyType:   nftables.TypeInetService,
	}

	if err := c.AddSet(set, []nftables.SetElement{
		{Key: binaryutil.BigEndian.PutUint16(69)},
		{Key: binaryutil.BigEndian.PutUint16(1163)},
	}); err != nil {
		fmt.Printf("c.AddSet() failed: %s", err.Error())
	}

	r := c.AddRule(&nftables.Rule{
		Table: table,
		Chain: chain,
		Exprs: []expr.Any{
			// [ meta load l4proto => reg 1 ]
			&expr.Meta{Key: expr.MetaKeyL4PROTO, Register: 1},
			// [ cmp eq reg 1 0x00000006 ]
			&expr.Cmp{
				Op:       expr.CmpOpEq,
				Register: 1,
				Data:     []byte{unix.IPPROTO_TCP},
			},

			// [ payload load 2b @ transport header + 2 => reg 1 ]
			&expr.Payload{
				DestRegister: 1,
				Base:         expr.PayloadBaseTransportHeader,
				Offset:       2,
				Len:          2,
			},
			// [ lookup reg 1 set __set%d ]
			&expr.Lookup{
				SourceRegister: 1,
				SetName:        set.Name,
				SetID:          set.ID,
			},
			// [ immediate reg 0 drop ]
			&expr.Verdict{
				Kind: expr.VerdictDrop,
			},
		},
	})

	fmt.Printf("Before flush rule Handle is %d\n", r.Handle)

	if err := c.Flush(); err != nil {
		fmt.Printf(err.Error())
	}

	fmt.Printf("After flush rule Handle is %d\n", r.Handle)

	out, err := exec.Command("/usr/sbin/nft", "-a", "list", "table", "filter").Output()
	if err != nil {
		fmt.Printf(err.Error())
	}
	fmt.Printf("%s\n", out)
}

Jan 08 '20 17:01 alexispires

@stapelberg do you need more informations?

Jan 13 '20 13:01 alexispires

@alexispires @stapelberg Can we preserve old Flush() call the way it is now, but new changes implement in something like FlushWithCallback. I will not be able to switch to new functionality right away, maybe eventually after running some performance tests.

Jan 15 '20 14:01 sbezverk

@alexispires @stapelberg Can we preserve old Flush() call the way it is now, but new changes implement in something like FlushWithCallback. I will not be able to switch to new functionality right away, maybe eventually after running some performance tests.

I let @stapelberg decide about it, but what do you suggest as API ?

Jan 18 '20 18:01 alexispires

@alexispires I was suggesting to have old API and new API in parallel, it will help to preserve the backward compatibility and give us a chance to move to new API when confidence/timing is right.

Jan 18 '20 18:01 sbezverk

Damn I need this! Any changes we can get this merged? 🙏

Sep 08 '22 15:09 prologic

I'm not familiar enough with the codebase to help resolve the merge conflicts :/

Sep 08 '22 15:09 prologic

There was some reluctance about it. I could continue this work if it's deemed useful.

Sep 09 '22 09:09 alexispires

There was some reluctance about it. I could continue this work if it's deemed useful.

Please do 🙏 I could really use the improvement you set out to fix here 🙇‍♂️

I think the comments from @stapelberg should be easily addressable 👌

Sep 09 '22 20:09 prologic

@alexispires Just an FYI, this is where I want to use this: https://git.mills.io/prologic/box/issues/18 so it would be awesome if you could revive this PR of yours, rebase it and hopefully we can get it merged soon 🤞 Let me know if I can help in any way, I'm not that familiar with the codebase (let alone nftables 🤣) but my Go is "okay" 😅

Sep 10 '22 01:09 prologic

@alexispires Just an FYI, this is where I want to use this: https://git.mills.io/prologic/box/issues/18 so it would be awesome if you could revive this PR of yours, rebase it and hopefully we can get it merged soon 🤞 Let me know if I can help in any way, I'm not that familiar with the codebase (let alone nftables 🤣) but my Go is "okay" 😅

I'm still waiting opinion (if it changed) from the maintainers about this feature.

However you can fix your problem by doing as this:

Maintain an unique id for each rule.
When creating your rules, set the UserData field with the defined id.
After the flush, fetch all rules.
Retrieve the right rule by comparing the UserData field with the defined id.
Retrieve the handle from the rule retrieved. After that you can return it, as doing here : https://git.mills.io/prologic/box/src/branch/master/nat/nat.go#L305

Sep 13 '22 15:09 alexispires

I'm still waiting opinion (if it changed) from the maintainers about this feature.

Is that @stapelberg ?

However you can fix your problem by doing as this:

Maintain an unique id for each rule.

When creating your rules, set the UserData field with the defined id.

After the flush, fetch all rules.

Retrieve the right rule by comparing the UserData field with the defined id.

Retrieve the handle from the rule retrieved. After that you can return it, as doing here : https://git.mills.io/prologic/box/src/branch/master/nat/nat.go#L305

Ahh brilliant! I didn't know you could do this 😅

Sep 13 '22 22:09 prologic

@alexispires If you at least rebase this PR I can also test it in my project 👌 I'm not sure I wouldn't screw up the merge conflicts myself 😂

Sep 13 '22 22:09 prologic