[Enhancement] Troubeshooting tools
Improvement Suggestion
Two ideas:
- Add
dae sysdumpto automatically collect network settings (routing, netfilter, sysctl, ...) and archive all info into a tar. This can save us from asking "would you mind showing me your netfilter using xxx cmd". - Add
dae trace --drop-only. This can make the tool much easier as users don't have specify ip or port.
Potential Benefits
An ideal bug report would be providing with sysdump.tar + trace.log if possible.
Thanks for opening this issue!
Should different command set be run based on different OS? Or just traverse the commands and collect the standard output into the pack?
@linglilongyi Hi thanks for asking!
I was wondering if calling syscall (wrapped by 3rd party library of course) rather than relying on external commands is a good idea.
For example, if we want to collect routing information on the system, instead of os.Exec("ip", "route"), we can call vishvananda/netlink#Handle.RouteList in the dae process, something like (demo code generated by ChatGPT ).
package main
import (
"fmt"
"log"
"net"
"github.com/vishvananda/netlink"
)
func main() {
// Create a new netlink handle
handle, err := netlink.NewHandle()
if err != nil {
log.Fatalf("Failed to create netlink handle: %v", err)
}
defer handle.Delete()
// Retrieve all routes
routes, err := handle.RouteList(nil, netlink.FAMILY_ALL)
if err != nil {
log.Fatalf("Failed to list routes: %v", err)
}
// Format and print each route
for _, route := range routes {
fmt.Println(formatRoute(route))
}
}
func formatRoute(route netlink.Route) string {
var dst, gw, dev string
if route.Dst == nil {
dst = "default"
} else {
dst = route.Dst.String()
}
if route.Gw != nil {
gw = route.Gw.String()
}
link, err := netlink.LinkByIndex(route.LinkIndex)
if err == nil {
dev = link.Attrs().Name
}
flags := ""
if route.Flags&netlink.FLAG_ONLINK != 0 {
flags += " onlink"
}
if route.Flags&netlink.FLAG_PERMANENT != 0 {
flags += " permanent"
}
return fmt.Sprintf("%s via %s dev %s%s", dst, gw, dev, flags)
}
This is faster and lower-costly.
I have some trouble.
First it seems that no 3rd party library can handle mission of listing xtable rule.
Second I have wrote a mod name sysdump.go under main/cmd , and I have wrote init function to add subcommand like the others. But there is not a sysdump subcommand showing in dae --help after building. Is there anything else I should modify besides the new go file?
@linglilongyi
netfilter is notorious for the lack of programmable APIs:
4.5 Is there an C/C++ API for adding/removing rules? The answer unfortunately is: No.
https://www.netfilter.org/documentation/FAQ/netfilter-faq-4.html#ss4.5
So feel free to exec command lines in this case.
As of how to add new subcommand, would you mind drafting a PR in your fork so I can look into what's missing?
I build up the ./dae and test the /bin/dae. lol...
Does the "drop-only" refer to conection with TcpFlag "S" ?
@linglilongyi Currently dae trace dumps huge amount of packets, most of them are not necessary. In most cases, we care about packets that are dropped by kernel, aka processed by kernel function kfree_skb_reason, or called by kfree_skbmem but without consume_skb. I can walk you through if you feel interested, but you don't have to push yourself into it. It's always okay (even recommended) to split big tasks into small parts and open PRs separately.
Currently
dae tracedumps huge amount of packets, most of them are not necessary. In most cases, we care about packets that are dropped by kernel, aka processed by kernel functionkfree_skb_reason, or called bykfree_skbmembut withoutconsume_skb.
I think that the "packets" refers to bpfObjects in trace.go. And the key is capture features of droping bpfObjects.
Could I insert
// isDropOnly is true only when called with "--drop-only"
if isDropOnly && NearestSymbol(event.Pc).Name != "kfree_skb_reason" {
continue
}
in https://github.com/daeuniverse/dae/blob/main/trace/trace.go#L276 to achieve that? Or I should drop the accepted package more upstreamly?
@linglilongyi dae trace prints packets' (skbs') lifetime, --drop-only is supposed to print the lifetime of dropped packet. It's more like:
if dropOnly && NearestSymbol() == "kfree_skb_reason" {
droppedSkbEvents := skbEvents[event.Addr]
for _, event := range droppedSkbEvents {
print()
}
}
We can collect events by skb address, hold the output until we see kfree_skb_reason. We should also delete(skbEvents, skbAddr) for normal skbs, otherwise skbEvents map increases infinitely.
It also looks feasible (and fun) to deliver a bpf-only approach. We could use hash of array bpf map (BPF_MAP_TYPE_ARRAY_OF_MAPS?) to temporarily store events in kernel space and push events to userpace only if kfree_skb_reason is seen.
I can walk you though the details if you like.
Should I replace unix mod with dae/tree/main/pkg/ebpf_internal/internal/unix/types_linux.go in #572 ?
And for the drop-only feature, I finally find some func that generated by bpf2go in your repo. I still need some time to go through it.
@jschwinger233 I have drafted a PR in my own fork, would you pleased have a check on it. It is all written in go now.
@linglilongyi Thank you! jschwinger233 and I are busy these days. We'll contact you as soon as free. FYI.