lo2s
lo2s copied to clipboard
IO Tracing
- [ ] find suitable data sources
- [ ] find out how to map them to otf2xx IO records.
focus on: block level IO, file level IO first
block level I/O
Blocklevel IO can be traced using those 2 tracepoints:
- block:block_rq_insert: event triggered by the insertion of a request into the queue
- block:block_rq_complete: Event triggered by request completion.
writing read begin and read end events from those tracepoints in process mode looks relatively easy.
however we can not use begin-end records in system mode because there is no total order of block io event issues and completions, so writing it as a sample is the best that we can do.
biosnoop from the bcc toolkit uses kprobes instead of tracepoints, but as far as I can see it the kprobes are not that different from the tracepoints above.
For now we should stick with tracepoints, I think, because they don't require set-up with perf probe
to be used, but we should keep kprobes in mind if we come across some place where we miss critical information with the tracepoint approach.
file level I/O
File Level I/O should be traceable by using:
- syscalls:sys_enter_open
- syscalls:sys_(enter|exit)_read
- syscalls:sys_(enter|exit)_write
- syscalls:sys_exit_close
at least it would be nice if it works like this because this seems to be the only file level tracepoints I've seen. However open,read, write, close definitively don't cover the whole zoo of file level operations, and we will probably miss a bunch (mmap? the dozen different versions of those syscalls like openat/writev ... ? And things that might bypass the classical POSIX interface alltogether).
Alternatively there is kprobe based reading directly on the virtual file system layer.
this would use kprobes on vfs_open/vfs_read/vfs_write/vfs_close.
But as I already said in the comment above non-sucky kprobe support in lo2s might be an absolute b*****, and might struggle with the issue that some information might be in pointers to kernel memory like a char *filename
which we can not access from lo2s unless we record the char*
into user space memory using BPF because bpf has access to kernel memory, but then again going through the struggle of setting up BPF just to copy some memory out of the kernel sounds like using an ICBM to kill a fly or mmap()
/dev/mem
and access kernel memory that way, which sounds like the mother of all hacks.
A detailed overview of the storage stack in Linux:
https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram
(Information based on "Understanding the Linux Kernel, which is based on 2.6, and the kernel source code for 5.something)
is there an advantage to tracing vfs_open/vfs_read/... over just tracing the syscalls?
No. The only thing vfs_open/vfs_read etc. apparently do is wrapping the open/read/... syscalls.
Is there a generic layer below vfs_open/vfs_read without cache effects?
No. The only thing vfs_read
does is looking up which filesystem handles what file and then delegating the call down to the filesystem specific read()
. Caching is handled entirely by the filesystem drivers (which makes sense because not all filesystems don't need caching, like procfs which is in memory anyways and contains dynamically generated content).
I hope this is a half-way legible representation of what I've learned about the fs stack this week.
The arrow labeled "Probe Here?" , which is the point at which the fs-dependent readpage() operation is called in generic_file_buffered_read() would be the place where we could learn if a read for a disk based filesystem* triggered an actual read on disk and didn't just end up in the page cache.
This has the problem that while the generic_file_read_iter()
is very mature and stable code that changes seldomly. Hard coding a specific offset in the generic_file_read_iter()
seems like something that breaks very easily.
Instrumenting the readpage()
functions of the different filesystems probably does not work, because readpage() ist both used by reads that had cache misses and the readahead. And we are probably only interested in real cache misses and not the readahead doing its work.
*if the disk based fs actually uses generic_file_read_iter() which is almost all, but not all.
- block I/O via tracepoints
- sys_* syscalls via tracepoints
file level I/O
File Level I/O should be traceable by using:
* syscalls:sys_enter_open * syscalls:sys_(enter|exit)_read * syscalls:sys_(enter|exit)_write * syscalls:sys_exit_close
Actually, it seems like nobody is using the open
syscall, but openat
instead.
perf record -e syscalls:sys_enter_open -e syscalls:sys_enter_openat -e syscalls:sys_enter_open_by_handle_at -e syscalls:sys_enter_mq_open -e syscalls:sys_enter_fsopen -a
0 syscalls:sys_enter_open
661K syscalls:sys_enter_openat
0 syscalls:sys_enter_open_by_handle_at
0 syscalls:sys_enter_mq_open
0 syscalls:sys_enter_fsopen