cpp_client_telemetry
cpp_client_telemetry copied to clipboard
Use of select() in lib/http/HttpClient_Curl.hpp causes crashes when >1024 filehandles in use
In lib/http/HttpClient_Curl.hpp, WaitOnSocket() uses select(), which is limited to monitoring file descriptors less than 1024. On a system that creates a large number of file handles (e.g. for client sockets), this limit will be exceeded and cause a crash in WaitOnSocket(). While I am using the library on FreeBSD, this problem would also apply to other platforms, namely Linux.
The Linux man page for select() describes this limitation and suggests using poll or epoll instead:
WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably low limit for many modern applications—and this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.
The crash is also reported for CentOS in the Open Telemetry C++ SDK which shares the same code-base for the curl http-client library with 1ds-cpp: Issue: https://github.com/open-telemetry/opentelemetry-cpp/issues/1220 PR for fix: https://github.com/open-telemetry/opentelemetry-cpp/pull/1410
My colleague created the following simple patch to use poll() instead of select() which we apply in our FreeBSD port build to prevent the crash:
--- lib/http/HttpClient_Curl.hpp.orig 2022-09-28 09:41:36.000000000 -0400
+++ lib/http/HttpClient_Curl.hpp 2022-09-28 09:46:46.000000000 -0400
@@ -10,6 +10,7 @@
#include <cstdio>
#include <cstdlib>
#include <cstdint>
+#include <poll.h>
#include <string.h>
#include <regex>
@@ -440,27 +441,20 @@
*/
static int WaitOnSocket(curl_socket_t sockfd, int for_recv, long timeout_ms)
{
- struct timeval tv;
- fd_set infd, outfd, errfd;
int res;
+ struct pollfd pfd;
- tv.tv_sec = timeout_ms / 1000;
- tv.tv_usec = (timeout_ms % 1000) * 1000;
+ pfd.fd = sockfd;
+ pfd.revents = 0;
- FD_ZERO(&infd);
- FD_ZERO(&outfd);
- FD_ZERO(&errfd);
-
- FD_SET(sockfd, &errfd); /* always check for error */
if(for_recv) {
- FD_SET(sockfd, &infd);
+ pfd.events = POLLIN | POLLERR;
} else {
- FD_SET(sockfd, &outfd);
+ pfd.events = POLLOUT | POLLERR;
}
- /* select() returns the number of signalled sockets or -1 */
- res = select((int)sockfd + 1, &infd, &outfd, &errfd, &tv);
+ res = poll(&pfd, 1, timeout_ms);
return res;
}
@peacenix @lalitb
There was a limit on the total number of concurrent simultaneous HTTPS connections that SDK makes to a single HTTP server in production. This is configured here . I wonder if it's happening in TEST or in production, and if there's a bug in a higher-level abstraction that doesn't properly limit the number of concurrent connections.
While the patch here is valid, I think we need to double-check why the higher-level code allows for more than 4 pending requests. This code is supposed to limit it: https://github.com/microsoft/cpp_client_telemetry/blob/1fd167fa1595853cd423f9d3229915ad0092a09b/lib/tpm/TransmissionPolicyManager.cpp#L120
This is the patch we applied in opentelemetry-cpp to fix the issue for both windows and Linux (poll is posix call, and not supported in windows).
https://github.com/open-telemetry/opentelemetry-cpp/pull/1410/files
But I was not aware of max concurrent connections configuration which @maxgolov mentioned. Need to debug this separately if that too is an issue.
Oh I noticed @peacenix has already shared the opentelemetry-cpp PR. Please ignore my comment on that :)
btw., opentelemetry-cpp now uses curl_multi_poll to poll for multiple handles simultaneously.
I think the fix makes sense in OpenTelemetry. But it's not exactly clear why we don't get the higher-level TPM limit to apply in 1DS. In 1DS C++ SDK the limit should've prevented the bug from happening. Maybe there's something in the HTTP curl client wrapper that doesn't clean and leaking the descriptors after the request is done? i.e. uploadCount() returns 0. But the actual number of fds remains high and is incremented over time.