fix: Thread safety issues in nfprofile with TLS implementation (phaag/nfsen#40)
Problem
nfprofile experiences sporadic file handling issues when running with multiple threads. These manifest as:
- Truncated filenames (e.g., "nfcapd.2025" instead of "nfcapd.202505090900")
- "File not found" errors when files exist but at a different path
- Subdirectory path issues where files are looked for in the wrong location
These issues were first introduced in commit 9289f6a when worker threads were added to nfprofile
Solution
This PR implements Thread Local Storage (TLS) for critical shared resources:
-
New
thread_local.hheader: Provides a portable TLS macro that works with both C11's_Thread_localand GCC/Clang's__thread -
Thread-safe static buffers:
- Replace the static buffer in
GetSubDir()with a TLS version - Prevents corruption of subdirectory paths by concurrent threads
- Replace the static buffer in
-
Thread-safe string handling:
- Create thread-local copies of
filenameandfilterfileinInitChannels() - Pass these thread-local copies to functions called from within threads
- Create thread-local copies of
-
Thread-safe library functions:
- Replace unsafe
localtime()with thread-safelocaltime_r()
- Replace unsafe
Testing
- Verifying that file paths and names are correctly handled under load
Related Issues
- Fixes phaag/nfsen#40
Hi,
Digging into the code, I cannot understand this patch, as all the code you changed is touched only by one thread at the time. Nfprofile threads are multiple separate workers, which filter data blocks for individual channels. The file handling for all the channels as well as open/close - subdir names etc. are non concurrent and done before and after the filtering.
for my understanding the InitChannels function runs outside the worker threads and processes the filenames strings before any threading begins. However, these filenames are then accessed by multiple worker threads when they process their assigned channels.
I suspect some building differences in the code. Do you or the package you install, has changed the building options, such as using OpenMP or other libraries/options, which do some artificial concurrent threading? If the thread local vars fix the problem, then this must be some parallel code execution, which was not intended by the original code . That would explain the behaviour.
Not that i'm aware of, nfdump with nfsens is running inside a ubuntu 24.04 docker container and build as follows:
sh autogen.sh
./configure --libdir=/usr/lib/ --enable-nsel --enable-nfprofile --enable-nftrack --enable-maxmind --enable-tor
----------------------------------
Build Settings for nfdump v1.7.6
----------------------------------
host type = linux-gnu
install dir = /usr/local
CC = gcc
CFLAGS = -g -O3 -std=gnu17 -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wmissing-noreturn -fno-strict-aliasing -DNSEL -DBUILDTOR -pthread
CPPFLAGS =
LDFLAGS =
LIBS = -lpthread -latomic -lrrd -llz4 -lbz2 -lzstd -lresolv
Enable liblz4 = yes
Enable libbz2 = yes
Enable libzstd = yes
Enable ja4 = no
Build geolookup = yes
Build torlookup = yes
Build sflow = no
Build nfpcapd = no - with gzip pcap-reader: yes
Build nfprofile = yes
Build ft2nfdump = no
----------------------------------
make
make install
Hi,
Digging into the code, I cannot understand this patch, as all the code you changed is touched only by one thread at the time. Nfprofile threads are multiple separate workers, which filter data blocks for individual channels. The file handling for all the channels as well as open/close - subdir names etc. are non concurrent and done before and after the filtering.
for my understanding the InitChannels function runs outside the worker threads and processes the filenames strings before any threading begins. However, these filenames are then accessed by multiple worker threads when they process their assigned channels.
The worker threads only touch the channel struct for data processing - filtering the source data record against individual channel filters. The workers only do record processing. When all data records are processed, the worker threats exit themselves.
The main thread call InitChannels as well as does open all input/output files - and - after all records are processed, closes the files. The worker threads never touch the filenames, nor use them somewhere. The filenames could even propagated down as const (which I should change). The worker threads only need worker_param_t for their job.
I suspect some building differences in the code. Do you or the package you install, has changed the building options, such as using OpenMP or other libraries/options, which do some artificial concurrent threading? If the thread local vars fix the problem, then this must be some parallel code execution, which was not intended by the original code . That would explain the behaviour.
Not that i'm aware of, nfdump with nfsens is running inside a ubuntu 24.04 docker container and build as follows:
sh autogen.sh ./configure --libdir=/usr/lib/ --enable-nsel --enable-nfprofile --enable-nftrack --enable-maxmind --enable-tor ---------------------------------- Build Settings for nfdump v1.7.6 ---------------------------------- host type = linux-gnu install dir = /usr/local CC = gcc CFLAGS = -g -O3 -std=gnu17 -Wall -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wmissing-noreturn -fno-strict-aliasing -DNSEL -DBUILDTOR -pthread CPPFLAGS = LDFLAGS = LIBS = -lpthread -latomic -lrrd -llz4 -lbz2 -lzstd -lresolv Enable liblz4 = yes Enable libbz2 = yes Enable libzstd = yes Enable ja4 = no Build geolookup = yes Build torlookup = yes Build sflow = no Build nfpcapd = no - with gzip pcap-reader: yes Build nfprofile = yes Build ft2nfdump = no ---------------------------------- make make install
The effect you describe happens only, if InitChannels() -> SetupProfileChannels() is run concurrently, as all the file name processing, and file handling happens there. After that point in code, filenames have been set, files are open and processed. There is no touching of filenames after InitChannels().
Therefore I still do not understand that this fixes the problem.
I can propose some code changes, which cleanup old code anyway.
Hi,
I can propose some code changes, which cleanup old code anyway.
I'm happy to test the code changes, and I'll also try compiling the current code with older compilers on older distributions. Hopefully, this might reveal something that explains the my issue.
Could you let me know if you are using nfdump with nfsen and nfprofiles, and which compiler and version were used to compile it?
I am running just one instance for compatibility checking on a Debian, and using the clang compiler.
In order to exclude other side effects, it is recommended to set the number of profilers in nfsen.conf to 1. This option resulted from old nfdump-1.6.x without threading. By setting the number of profilers to 1, it's guaranteed, that only one instance is running, although the code should theoretically work with multiple profilers .. just to make sure.
I am running just one instance for compatibility checking on a Debian, and using the clang compiler. bookworm?
In order to exclude other side effects, it is recommended to set the number of profilers in nfsen.conf to 1. This option resulted from old nfdump-1.6.x without threading. By setting the number of profilers to 1, it's guaranteed, that only one instance is running, although the code should theoretically work with multiple profilers .. just to make sure.
I also tried this while debugging, but unfortunately, it didn’t help. Next, I’ll try setting up two Bookworm containers: one with the number of profilers set to 1 and the other with it set to 2.
Seems not to make any progress. Therefore closed for now