nDPI
nDPI copied to clipboard
`struct ndpi_detection_module_struct` and multi-thread
I am trying to summarize here the various discussions on multi threads support.
What is the current situation?
*) struct ndpi_detection_module_struct
is not thread-safe and it has never been.
*) This limitation is not well documented
*) ndpiReader example configures one struct ndpi_detection_module_struct
for each thread
*) any implementations sharing the same struct ndpi_detection_module_struct
across multiple threads is buggy
*) struct ndpi_detection_module_struct
is not thread-safe because of the LRU caches (at least)
*) commit 730c236 made struct ndpi_detection_module_struct
even less thread-safe
*) it seems possible to make struct ndpi_detection_module_struct
thread-safe (even after 730c236)
Now, what options do we have?
-
Do nothing (apart from improving documentation). Keep
struct ndpi_detection_module_struct
not thread safe; keep using one instance of it for each thread/core -
We acknowledge that having some shared structures/data among the thread/core might be useful but we don't want to change
struct ndpi_detection_module_struct
. We might implement two different "contexts": one "global context" (one for application; thread safe; to be defined) and we keepstruct ndpi_detection_module_struct
as a "local context" (one for thread/core). Nothing new here: this is a common pattern. -
We make
struct ndpi_detection_module_struct
full thread safe (does that imply a penalty cost when using one instance per thread?) -
...
Let me know if I got something wrong and let's discuss!
It makes sense to discuss.
Fully initializing ndpi_detection_module_struct in every thread is very memory intensive. This is not only the ndpi_detection_module_struct structure itself(228 kB), but also all the data in it. ahocorasick, patricia, ndpi_default_ports_tree_node_t also take up memory. The efficiency of libcache generally drops to 0. Based on this, I think that we should try to make libndpi thread-safe.
At this point, we know for sure that there are two elements in struct ndpi_detection_module_struct that are not multithread safe. These are libcache and LRU. It's easy to fix. Theoretically, there are common structures of the "ndpi_id_struct" type, but there practically all data is atomic except for irc_port, last_time_port_used.
I think that, first of all, we need a test case that could demonstrate two things: working in the mode of one thread gives and in the mode of several threads gives the same result, and the execution speed should decrease in proportion to the number of threads.
I think an ideal candidate or starting point for this is examples/ndpiSimpleIntegration.c
If successful, you can add constraints to the documentation.
The implementation of multithreading does not require additional resources as long as we have an insignificant amount of common modifiable data.
Before commit 730c236, one limitation was significant: one flow is simultaneously processed by one thread, since all the data we change is in the ndpi_flow structure.
After commit 730c236, it is quite easy to restore multithreading:
- replace "& ndpi_struct.packet" with a macro or inline function in the library code.
- add a call to the API that returns the address of the ndpi_packet_struct structure (though I don't know why this is needed in ndpiReader).
- Fix LRU
- replace libcache.
I haven't come across very often implementations of blocking in userspace. You need to understand which locking mechanism to use, provided that there are no lock-free options. So far, I think mutex is the only portable solution. Maybe there are more lightweight implementations of locks similar to the linux kernel RCU?
I agree that "sharing" a ndpi_struct.packet could save memory, but this if and only if all threads are supposed to do the same job (for example you use RSS to share the load across multiple threads). In this case you have a lot of traffic so a few KB extra isn't a problem if this guarantees you consistency. I would spend time to see if the memory could be further reduced by removing unnecessary data structures that might ease this task of re-introducing multithread. As of shared instances such as LRU and libcache, they are places where some synchronization is currently necessary, you are right. I believe that having a per-thread instance is a good solution for avoiding locks in multithreaded environments: what is you view on this?
I started testing the variant with asynchronous work in ndpiReader. The main thread reads the file and distributes the packets among several threads. The only limitation is that one connection can only be processed by one thread at a time. For work, one ndpi_detection_module_struct structure and several threads executing ndpi_detection_process_packet () are used.