libpcap
libpcap copied to clipboard
Should the stdio buffer size be increased for the dump code?
A blog post from Napatech says:
The most obvious bottleneck was in the pcap_dump() function within libpcap and actually the problem should be fixed within libpcap if you ask me, but I fixed it in DPDK temporally. Basically the pcap_dump() function does a fwrite(pcap header of 16Bytes) followed by a fwrite(packet payload) and these two fwrite(s) cause two kernel traps, which are expensive and should be minimized. The solution is to buffer the output to file and write seldom but more data at every write, so in essence do: if (buffer has room) do memcpy header and payload to buffer else fwrite buffer
That's not strictly true; what fwrite() does is, in essence:
if (buffer has room) do
memcpy header and payload to buffer
else
write buffer
so each call to fwrite() does not result in a single system call ("write buffer" is assumed to turn into a system call - write() on UN*X, _write() -> WriteFile() -> NtWriteFile() on modern Windows).
However, he may have used a larger buffer in the buffering wrapped around fwrite() than the standard I/O library was using. On Linux with glibc, and on BSD-flavored UN*Xes including macOS, the buffer size for a regular file will be the st_blksize value supplied by the file system for an fstat() (unless the file system fails to supply it and it's zero, in which case it's 8KiB with glibc and 1KiB on BSD-flavored UN*Xes). Other UN*Xes may also use st_blksize. On Windows, it's 4KiB.
st_blksize, in 4.2BSD, was the FFS block size of the file system, typically 8KiB. It appears to be 4MiB on APFS and 4KiB on HFS+ on High Sierra.
So what we might want to do is:
bufsize = something large (bigger than 8KiB, at least);
#ifdef WE_HAVE_ST_BLKSIZE_IN_STRUCT_STAT
if (fstat(fileno(the file), &statb) succeeds and statb.st_blksize > bufsize)
bufsize = statb.st_blksize;
#endif
buffer = malloc(bufsize);
if (buffer == NULL)
either fail completely or give up on increasing the buffer;
setvbuf(fileno, buffer, _IOFBF, bufsize);
The Single UNIX Specification doesn't guarantee that passing a non-zero buffer size and a null buffer pointer will cause setvbuf() to allocate a buffer of that size internally; the C99 spec says only that
If
bufis not a null pointer, the array it points to may be used instead of a buffer allocated by thesetvbuffunction) and the argumentsizespecifies the size of the array; otherwise,sizemay determine the size of a buffer allocated by thesetvbuffunction.
emphasis on "may". Therefore, we allocate it manually.
This is a bit tricky as it would require us to free the buffer on close; the C99 spec says of fclose() that
Whether or not the call succeeds, the stream is disassociated from the file and any buffer set by the setbuf or setvbuf function is disassociated from the stream (and deallocated if it was automatically allocated).
which seems to indicate that such a buffer won't be freed automatically on close as it was manually allocated outside the standard I/O library.
He did use a larger buffer - 16KiB. That was probably on Linux; the default st_blksize on Linux appears to be the page size, so it was probably 4KiB. If a 16KiB fwrite() bypasses the buffer and just writes directly to the file, as a result of being a multiple of the buffer size, then that'd be a 4x reduction in the number of write() calls and thus a 4x reduction in the number of system calls, which might make a significant difference.
A bigger buffer would result in an even greater reduction, although I don't know whether a buffer that's close to or bigger than the L1 data cache (32K on most of Intel's x86-64 processors) would be a problem.
As for an automatically-allocated buffer:
- The Linux
setvbuf()documentation, which presumably is describing the glibc version, "If the argumentbufisNULL, only the mode is affected; a new buffer will be allocated on the next read or write operation.". The GNU libc documentation forsetvbuf(), however, just says "If you specify a null pointer as the buf argument, thensetvbufallocates a buffer itself usingmalloc. This buffer will be freed when you close the stream. ... Otherwise, buf should be a character array that can hold at least size characters.", which doesn't explicitly indicate that the buffer allocated bysetvbuf()will have size size. - The MUSL libc documenation just says that the official documentation is the C standard and the POSIX standard and that "This portion of the manual is incomplete. Future editions will document musl's behavior where the standards specify that it is implementation-defined, non-standard extensions musl implements, and additional properties of musl's implementation.", so they don't say one way or the other whether
setvbuf()has been extended to support setting the buffer size without explicitly setting the buffer. - The Bionic libc documentation says nothing, but the Bionic
setvbuf()source indicates that it - and probably the rest of the standard I/O library - is BSD-derived, so it probably implements the same "NULL argument plus non-zero size plus not unbuffered means a buffer of the specified size will be automatically allocated" behavior, and that's what the code appears to do. - The FreeBSD 3.0
setvbuf()documentation says "If thesizeargument is not zero butbufisNULL, a buffer of the given size will be allocated immediately, and released on close.". - The NetBSD 2.0, OpenBSD 2.2, DragonFly BSD, and macOS
setvbuf()documentation say the same thing (not surprising, given that they probably all started with the 4.4-Lite libc, and that behavior probably dates back a long time for all those OSes). - The Visual Studio 2015
setvbuf()documentation says "The array pointed to by buffer is used as the buffer, unless it is NULL, in which case setvbuf uses an automatically allocated buffer of length size/2 * 2 bytes.".
However:
- The Solaris 11.4
setvbuf()documentation just says "Ifbufis not the null pointer, the array it points to will be used for buffering, instead of an automatically allocated buffer. Thesizeargument specifies the size of the buffer to be used. If input/output is unbuffered,bufandsizeare ignored.", which could be read as suggesting that ifbufis the null pointer, and the stream is buffered, it automatically allocates a buffer ofsizebytes, but they don't unequivocally state that. - The AIX 5.3
setvbuf()documentation says "If the Buffer parameter is not a null character pointer, the array it points to is used for buffering. The Size parameter specifies the size of the array to be used as a buffer, but all of the Size parameter’s bytes are not necessarily used for the buffer area.", which could again be read as suggesting that a null Buffer argument, non-zero Size argument, and Mode argument other than_IONBFmeans that a buffer of the specified size will automatically be allocated, but they don't say it explicitly; - The HP-UX 11i v3
setvbuf()documentation says "If buf is not the NULL pointer, the array it points to is used for buffering instead of an automatically allocated buffer (from malloc()). size specifies the size of the buffer to be used. size specifies the size of the buffer to be used.", which again suggests that you can set the size of an automatically-allocated buffer but doesn't explicitly state it; - The Tru64 UNIX 5.1
setvbuf()documentation says "If thebufferparameter is not a null pointer, the array that the parameter points to is used for buffering instead of a buffer that is automatically allocated. Thesizeparameter specifies the size of the buffer to be used." - you know the drill; - The IRIX 6.5
setvbuf()documentation says "If buf is the NULL pointer, or size is less than eight, space will be allocated to accommodate a buffer. This buffer will be of length BUFSIZ.", which means that 1) the good news is that a NULL buf means it'll automatically allocate a buffer but 2) the bad news is that, no matter what the value of size is, the buffer will be BUFSIZ bytes.
The System V Release 2 setvbuf() documentation says "If buf is not the NULL pointer, the array it points to will be used for buffering, instead of an automatically allocated buffer. Size specifies the size of the buffer to be used.", so that appears to be where the second group of UN*Xes - all SV-based, with a ton of BSD and vendor stuff added on top (Solaris being based on SVR4 which was a huge merge of SunOS 4.x into the SV code base, giving not only BSD stuff but Sun stuff) - got their man page text, so I guess some tomatoes should be thrown at AT&T for not making it clearer.
So, just for the lulz, let's see if we can find the SV source.
Here on archive.org there's a page that claims to have SV source from several releases; the tarball for SVR2 for NS32xxx (SysVr2.0-32000) has a version of setvbuf() with what appears to be the AT&T copyright notice from my days at Sun, and it...
...will, if the mode argument isn't _IONBF:
- set the size to the minimum of the size argument and
BUFSIZ; - if that's bigger than
BUFSIZ, or a buffer hasn't already been allocated, allocate a buffer of the specified size plus 8 bytes.
So I rather suspect that the `setvbuf() in any UN*X with an SV-derived stdio library - which probably means "all the traditional commercial UN*Xes that still survive, plus Tru64 and even IRIX" - will, if called early enough (i.e., before you do anything that needs a buffer), and passed a null buffer pointer and a non-zero buffer size, allocate a buffer of that size.
Any UN*X with a 4.4 BSD/4.4-Lite BSD-derived stdio library - which means the *BSDs, macOS/iOS/tvOS/watchOS, and, apparently, Android - will do so as well.
Unless the Linux documentation is lying, GNU libc's stdio library will do that.
The Windows libc will do that as well.
And, from a look at the MUSL source, it ignores the size argument if the buf argument is null.
So my inclination is to do the setvbuf() call; if it doesn't increase the buffer size, you lose, I guess, and should go yell at the supplier of your libc to make it work. At least the dumping APIs will still work, with the default buffer size, even if that means more system calls than you'd like.
(Given that a pcap_dumper_t * is just a FILE *, meaning we don't have some data structure that we control, we have no mechanism for getting at the buffer at close time to free it, so we have to have the automatically-allocated buffer.)
Something like this? --- a/sf-pcap.c +++ b/sf-pcap.c @@ -771,7 +771,11 @@ pcap_setup_dump(pcap_t *p, int linktype, FILE *f, const char *fname) else setvbuf(f, NULL, _IONBF, 0); #endif
-
if (sf_write_header(p, f, linktype, p->snapshot) == -1) {
+#define IO_BUF_SIZE (16 * 1024)
- if (f != stdout) {
-
setvbuf(f, NULL, _IOFBF, IO_BUF_SIZE); - }
-
if (sf_write_header(p, f, linktype, p->snapshot) == -1) { pcap_fmt_errmsg_for_errno(p->errbuf, PCAP_ERRBUF_SIZE, errno, "Can't write to %s", fname); if (f != stdout)
With GNU libc (glibc), setvbuf(fp, NULL, _IOFBF, size) will ignore size if buf is NULL, see
libio/iosetvbuf.c:
int _IO_setvbuf(FILE* fp, char* buf, int mode, size_t size) {
int result;
CHECK_FILE(fp, EOF);
_IO_acquire_lock(fp);
switch (mode) {
case _IOFBF:
fp->_flags &= ~(_IO_LINE_BUF | _IO_UNBUFFERED);
if (buf == NULL) {
if (fp->_IO_buf_base == NULL) {
/* ... */
if (_IO_DOALLOCATE(fp) < 0) {
result = EOF;
goto unlock_return;
}
fp->_flags &= ~_IO_LINE_BUF;
}
result = 0;
goto unlock_return;
}
break;
// ...
}
result = _IO_SETBUF(fp, buf, size) == NULL ? EOF : 0;
unlock_return:
_IO_release_lock(fp);
return result;
}
If buf is NULL, then a buffer is allocated via _IO_DOALLOCATE. That is a macro that will probably end up calling _IO_file_doallocate from libio/filedoalloc.c which uses BUFSIZE or (if positive and smaller), st.st_blksize. So this is probably not what you want, and an explicit buffer has to be given.
Yeah, the Linux man page isn't as explicit as I'd like, but "If the argument *buf* is NULL, only the mode is affected" does suggest that the size isn't affected.
If an explicit buffer is given, we don't have any good way to free it in pcap_dump_close(), given that:
- a
pcap_dumper_t *is just another name for aFILE *, so we don't have any place to save a pointer to the buffer; - there's code out there that needs to, for example, do an
fflush()on the file being written, and does so by doing thefflush()on thepcap_dumper_t *cast to aFILE *ifpcap_dump_flush()isn't available, so it's risky to change what apcap_dumper_tis; - there's no generally-available API to get a pointer to the buffer.
So the choices we have are:
- don't do this on Linux;
- develop new APIs for dumping, with a handle type different from
pcap_dumper_t *that points to an opaque structure that includes theFILE *and avoid *pointing to the buffer.
I'm inclined to go for the second option.
Guy Harris [email protected] wrote: > 1 a pcap_dumper_t * is just another name for a FILE *, so we don't have > any place to save a pointer to the buffer; 2 there's code out there > that needs to, for example, do an fflush() on the file being written, > and does so by doing the fflush() on the pcap_dumper_t * cast to a FILE
I guess this was a bad thing. Such code would fail to recompile, but of course it's an ABI change.
> * if pcap_dump_flush() isn't available, so it's risky to change what a
> pcap_dumper_t is; 3 there's no generally-available API to get a pointer
> to the buffer.
:-(
> So the choices we have are:
> 1 don't do this on Linux;
> 2 develop new APIs for dumping, with a handle
> type different from pcap_dumper_t * that points to an opaque structure
> that includes the FILE * and a void * pointing to the buffer.
> I'm inclined to go for the second option.
Me too. We are going to need it for pcapng.
-- ] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | IoT architect [ ] [email protected] http://www.sandelman.ca/ | ruby on rails [
We are going to need it for pcapng.
Yes. I wouldn't want to require pcapng to get the bigger stdio buffer, and I'm not sure whether we should have a single new API for writing both pcap and pcapng files, but either way we should have support for writing pcap files, not just pcapng files, with a bigger stdio buffer.
pcap_dumper_t is already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted from FILE * is an internal implementation detail. Are there any issues with converting this to an actual structure?
pcap_dumper_tis already an opaque structure (already since its initial introduction in 1999), the fact that it is type-casted fromFILE *is an internal implementation detail. Are there any issues with converting this to an actual structure?
To quote an earlier comment:
there's code out there that needs to, for example, do an
fflush()on the file being written, and does so by doing thefflush()on thepcap_dumper_t *cast to aFILE *ifpcap_dump_flush()isn't available, so it's risky to change what apcap_dumper_tis;
I'm unclear if there is still work to do here.
I think not, and the lack of activity during 5 months backs me up. Can it be closed?
We are going to do the setvbuf() in case it helps. Ticket remains open for now.