PyMISP
PyMISP copied to clipboard
Add/Allow Specification of tcp_keepalive timers
Environment
pyMISP version: 2.4.159:
Python Notebooks running on Azure Databricks Cloud Clusters
MISP Servers (2.4.159) are On-Premise
Issue:
We are unable to query 'real-world' data sets via MISP Server APIs via pyMISP across certain cloud infrastructures. These environments have proxy-servers with TCP Session Timeouts set to 120 Seconds. The parties managing these large infrastructure proxies will not change global session timeouts or allow bypass.
Regardless of how we paginate our queries we cannot get responses from our servers/data sets in less than 120 seconds. In fact, many real-world scenarios require upwards of 20 minutes to 2 hours to find and fully return requisite data.
Note that in directly connected environments, we successfully run these queries, typically with no pagination required, and return millions of 'rows'.
Code Sample
# Set up Pagination
pag = 1
pagcnt = 200
pagsize = 10000
attrcnt = 1
#
print('Starting Action: \t', datetime.now())
print('Page Size: ', pagsize)
print('Page Count: ', pagcnt)
# Set - Up Query
while pag < pagcnt:
print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
r = misp.search(controller='attributes',
return_format='csv',
type_attribute=['ip-src', 'ip-dst'],
tags=['nist-cyber-framework:function="protect"'],
publish_timestamp='90d',
metadata=True,
pythonify=True,
limit=pagsize)
for e in r:
attrcnt=attrcnt +1
print(e)
#pp.pprint(e)
#pp.pprint(e.value)
print('Loop Completed:\t', datetime.now(),"\tAttribute Count:\t", attrcnt - 1, '\nPage:', pag,' of: ', pagcnt)
pag = pag + 1
print('Ending Action: \t', datetime.now())
Error message received (full text below):
ProxyError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
What we've tried to date:
We've tried setting the following on the DataBrick 'client' side
%sh
#==========================
sysctl -w \
net.ipv4.tcp_keepalive_time=60 \
net.ipv4.tcp_keepalive_intvl=60 \
net.ipv4.tcp_keepalive_probes=9
#==========================
We've also tried variations of:
import socket
def setkeepalives(sck):
sck.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
sck.setsockopt(IPPROTO_TCP, TCP_KEEPIDLE, 1)
sck.setsockopt(IPPROTO_TCP, TCP_KEEPINTVL, 3)
sck.setsockopt(IPPROTO_TCP, TCP_KEEPCNT, 5)
sck.setdefaulttimeout(7200)
Full text of Error Message received after ~120 seconds
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
669 # Make the request on the httplib connection object.
--> 670 httplib_response = self._make_request(
671 conn,
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
425 # Otherwise it looks like a bug in the code.
--> 426 six.raise_from(e, None)
427 except (SocketTimeout, BaseSSLError, SocketError) as e:
/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in raise_from(value, from_value)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
420 try:
--> 421 httplib_response = conn.getresponse()
422 except BaseException as e:
/usr/lib/python3.8/http/client.py in getresponse(self)
1347 try:
-> 1348 response.begin()
1349 except ConnectionError:
/usr/lib/python3.8/http/client.py in begin(self)
315 while True:
--> 316 version, status, reason = self._read_status()
317 if status != CONTINUE:
/usr/lib/python3.8/http/client.py in _read_status(self)
276 def _read_status(self):
--> 277 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
278 if len(line) > _MAXLINE:
/usr/lib/python3.8/socket.py in readinto(self, b)
668 try:
--> 669 return self._sock.recv_into(b)
670 except timeout:
/usr/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
1240 self.__class__)
-> 1241 return self.read(nbytes, buffer)
1242 else:
/usr/lib/python3.8/ssl.py in read(self, len, buffer)
1098 if buffer is not None:
-> 1099 return self._sslobj.read(len, buffer)
1100 else:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
488 if not chunked:
--> 489 resp = conn.urlopen(
490 method=request.method,
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
725
--> 726 retries = retries.increment(
727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
/databricks/python/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
445 if new_retry.is_exhausted():
--> 446 raise MaxRetryError(_pool, url, error or ResponseError(cause))
447
MaxRetryError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
During handling of the above exception, another exception occurred:
ProxyError Traceback (most recent call last)
<command-568716391959277> in <module>
12 while pag < pagcnt:
13 print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
---> 14 r = misp.search(controller='attributes',
15 return_format='csv',
16 type_attribute=['ip-src', 'ip-dst'],
/databricks/python/lib/python3.8/site-packages/pymisp/api.py in search(self, controller, return_format, limit, page, value, type_attribute, category, org, tags, quick_filter, quickFilter, date_from, date_to, eventid, with_attachments, withAttachments, metadata, uuid, publish_timestamp, last, timestamp, published, enforce_warninglist, enforceWarninglist, to_ids, deleted, include_event_uuid, includeEventUuid, include_event_tags, includeEventTags, event_timestamp, sg_reference_only, eventinfo, searchall, requested_attributes, include_context, includeContext, headerless, include_sightings, includeSightings, include_correlations, includeCorrelations, include_decay_score, includeDecayScore, object_name, exclude_decayed, pythonify, **kwargs)
2558 response = self._prepare_request('POST', url, data=query, output_type='xml')
2559 else:
-> 2560 response = self._prepare_request('POST', url, data=query)
2561
2562 if return_format == 'csv':
/databricks/python/lib/python3.8/site-packages/pymisp/api.py in _prepare_request(self, request_type, url, data, params, kw_params, output_type, content_type)
3587 settings = self.__session.merge_environment_settings(req.url, proxies=self.proxies or {}, stream=None,
3588 verify=self.ssl, cert=self.cert)
-> 3589 return self.__session.send(prepped, timeout=self.timeout, **settings)
3590
3591 def _csv_to_dict(self, csv_content: str) -> List[dict]:
/databricks/python/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
699
700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
702
703 # Total elapsed time of the request (approximately)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
557
558 if isinstance(e.reason, _ProxyError):
--> 559 raise ProxyError(e, request=request)
560
561 if isinstance(e.reason, _SSLError):
Just making sure before I investigate further: have you tried passing a timeout value in the PyMISP call?
See: https://github.com/MISP/PyMISP/blob/bb9f053b6892503851451a5eb94a7127afd6ec8a/pymisp/api.py#L141
Thanks for the timely response. The issue exists with the SSL Proxy between the client and server.
I've tried setting Timeout previously (and again this AM), but it has no impact.
misp = PyMISP(misp_url, misp_key, misp_verifycert, debug=False, timeout=7200)
Starting Action: 2022-07-18 14:45:32.483944
Page Size: 80000
Page Count: 200
Starting Loop: 2022-07-18 14:45:32.484146
Page: 1 of: 200
ProxyError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
Command took 2.01 minutes -- by [email protected] at 7/18/2022, 10:45:32 AM on TISS Cluster-01
I'm trying to enable/set TCP Keepalive 'tickles" every 60 seconds to keep the proxy connection from timing out.
To show how a "timeout" issue would manifest itself in our environment, I set timeout=30:
ReadTimeout: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)
Full error log:
---------------------------------------------------------------------------
timeout Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
425 # Otherwise it looks like a bug in the code.
--> 426 six.raise_from(e, None)
427 except (SocketTimeout, BaseSSLError, SocketError) as e:
/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in raise_from(value, from_value)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
420 try:
--> 421 httplib_response = conn.getresponse()
422 except BaseException as e:
/usr/lib/python3.8/http/client.py in getresponse(self)
1347 try:
-> 1348 response.begin()
1349 except ConnectionError:
/usr/lib/python3.8/http/client.py in begin(self)
315 while True:
--> 316 version, status, reason = self._read_status()
317 if status != CONTINUE:
/usr/lib/python3.8/http/client.py in _read_status(self)
276 def _read_status(self):
--> 277 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
278 if len(line) > _MAXLINE:
/usr/lib/python3.8/socket.py in readinto(self, b)
668 try:
--> 669 return self._sock.recv_into(b)
670 except timeout:
/usr/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
1240 self.__class__)
-> 1241 return self.read(nbytes, buffer)
1242 else:
/usr/lib/python3.8/ssl.py in read(self, len, buffer)
1098 if buffer is not None:
-> 1099 return self._sslobj.read(len, buffer)
1100 else:
timeout: The read operation timed out
During handling of the above exception, another exception occurred:
ReadTimeoutError Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
488 if not chunked:
--> 489 resp = conn.urlopen(
490 method=request.method,
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
725
--> 726 retries = retries.increment(
727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
/databricks/python/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
409 if read is False or not self._is_method_retryable(method):
--> 410 raise six.reraise(type(error), error, _stacktrace)
411 elif read is not None:
/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
734 raise value.with_traceback(tb)
--> 735 raise value
736 finally:
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
669 # Make the request on the httplib connection object.
--> 670 httplib_response = self._make_request(
671 conn,
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
427 except (SocketTimeout, BaseSSLError, SocketError) as e:
--> 428 self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
429 raise
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _raise_timeout(self, err, url, timeout_value)
334 if isinstance(err, SocketTimeout):
--> 335 raise ReadTimeoutError(
336 self, url, "Read timed out. (read timeout=%s)" % timeout_value
ReadTimeoutError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)
During handling of the above exception, another exception occurred:
ReadTimeout Traceback (most recent call last)
<command-568716391959277> in <module>
12 while pag < pagcnt:
13 print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
---> 14 r = misp.search(controller='attributes',
15 #return_format='csv',
16 type_attribute=['ip-src', 'ip-dst'],
/databricks/python/lib/python3.8/site-packages/pymisp/api.py in search(self, controller, return_format, limit, page, value, type_attribute, category, org, tags, quick_filter, quickFilter, date_from, date_to, eventid, with_attachments, withAttachments, metadata, uuid, publish_timestamp, last, timestamp, published, enforce_warninglist, enforceWarninglist, to_ids, deleted, include_event_uuid, includeEventUuid, include_event_tags, includeEventTags, event_timestamp, sg_reference_only, eventinfo, searchall, requested_attributes, include_context, includeContext, headerless, include_sightings, includeSightings, include_correlations, includeCorrelations, include_decay_score, includeDecayScore, object_name, exclude_decayed, pythonify, **kwargs)
2558 response = self._prepare_request('POST', url, data=query, output_type='xml')
2559 else:
-> 2560 response = self._prepare_request('POST', url, data=query)
2561
2562 if return_format == 'csv':
/databricks/python/lib/python3.8/site-packages/pymisp/api.py in _prepare_request(self, request_type, url, data, params, kw_params, output_type, content_type)
3587 settings = self.__session.merge_environment_settings(req.url, proxies=self.proxies or {}, stream=None,
3588 verify=self.ssl, cert=self.cert)
-> 3589 return self.__session.send(prepped, timeout=self.timeout, **settings)
3590
3591 def _csv_to_dict(self, csv_content: str) -> List[dict]:
/databricks/python/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
699
700 # Send the request
--> 701 r = adapter.send(request, **kwargs)
702
703 # Total elapsed time of the request (approximately)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
576 raise SSLError(e, request=request)
577 elif isinstance(e, ReadTimeoutError):
--> 578 raise ReadTimeout(e, request=request)
579 elif isinstance(e, _InvalidHeader):
580 raise InvalidHeader(e, request=request)
ReadTimeout: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)
Right, the session is killed by the proxy. I had a quick look and it seems that python relies on the system config for the TCP keepalive probes, and you can configure it this way: https://webhostinggeeks.com/howto/configure-linux-tcp-keepalive-setting/
More precisely, this one: net.ipv4.tcp_keepalive_intvl
But it seems your proxy is killing it after 2 min, which isn't the default on linux (should be 75s). Could it be that your proxy ignores the probes and kills the session anyway? Can you tcpdump the link and see if the probe is sent or not?
Please have a look earlier in the thread:
What we've tried to date:
We've tried setting the following on the DataBrick 'client' side
%sh
#==========================
sysctl -w
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=9
#==========================
Question:
Do I need to enable keepalive specifically in order to activate?
Note that the above values are accepted on the Linux server that executes PyMISP:
%sh
2
sysctl -a 2>/dev/null | grep -i keepalive
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 60
Command took 0.06 seconds -- by [email protected] at 7/18/2022, 11:40:48 AM on TISS Cluster-01
pcaps will be a little tricky - let me see what I can do...
Alright, sorry, I missed that.
It seems the keepalive config needs to be enabled manually: https://github.com/psf/requests/issues/5430#issuecomment-773738453
In theory, if you try add the code below before you do the PyMISP calls, it could (should?) work.
import socket
from urllib3.connection import HTTPConnection
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
,(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45)
,(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10)
,(socket.SOL_TCP, socket.TCP_KEEPCNT, 6)
]
If it works, I'll add it by default in the library, I think it makes sense.
No impact - same error.
Update: I made some changes to do pcaps and restarted the Databricks cluster. So far it seems to be working (at least gets past 2 minutes without failing)
Oh, good news. Are you running the script with the code above? Or did you change something on the Databricks side?
There's a secondary issue at play here - you shouldn't be getting those long queries either way. We can mark this as resolved if you agree @packet-rat, but we have also identified the root cause for slow paginated queries in the meanwhile.
Basically: Whenever you get a paginated query we actually run a COUNT(*) query with the same parameters without the pagination filters to be able to set the x-result-count header. This usually ends up being excessively slow (close to just running the query unpaginated). So we're thinking how we can avoid that (by having a header that just signals that there's more data left to grab rather than showing an exact count) - so hopefully with some upcoming changes to that the bottleneck will disappear.
Update:
- The queries are still timing out. There are some scenarios that run beyond 2 minutes, but ultimately turned out to be related to a small set of data being returned. :-(
- Yes, there are a few performance issues.
(1) We only added the pagination when our 'normal' scripts failed in this new Azure Databricks environment. The code snippet/test above removed pagination.
(2) We are seeing really poor performance when queries are based on Tags that are applied to millions of Attributes. For example, we were applying NIST Cybersecurity Action tags at the Event level (where these assertions were global to that Event's Objects/Attributes). We changed these large data sets to Attribute tags to improve/extend Scoring/Aging algorithms
I was planning to open a separate issue for this Tag performance issue.
Hmm the tag performance is an interesting pointer. It probably runs a stupid query with all enumerated tag occurences used as a where clause.
Okay, so I tested with a sample code, and this code snippet (https://github.com/MISP/PyMISP/issues/848#issuecomment-1187666336) sends tcp keepalive packets, tcpdump confirms.
And that would confirm what you're saying with the requests that sometimes keep going for more than 2 min.
And one more thing regarding your snippet: I'd recommend to avoid the pythonify switch for huge requests like that, it is going to use a lot of ram.
Thanks - Great TIP!