PyMISP icon indicating copy to clipboard operation
PyMISP copied to clipboard

Add/Allow Specification of tcp_keepalive timers

Open packet-rat opened this issue 1 year ago • 16 comments

Environment

pyMISP version: 2.4.159:
Python Notebooks running on Azure Databricks Cloud Clusters MISP Servers (2.4.159) are On-Premise

Issue:

We are unable to query 'real-world' data sets via MISP Server APIs via pyMISP across certain cloud infrastructures. These environments have proxy-servers with TCP Session Timeouts set to 120 Seconds. The parties managing these large infrastructure proxies will not change global session timeouts or allow bypass.

Regardless of how we paginate our queries we cannot get responses from our servers/data sets in less than 120 seconds. In fact, many real-world scenarios require upwards of 20 minutes to 2 hours to find and fully return requisite data.

Note that in directly connected environments, we successfully run these queries, typically with no pagination required, and return millions of 'rows'.

Code Sample

# Set up Pagination
pag = 1
pagcnt = 200
pagsize = 10000
attrcnt = 1
#
print('Starting Action: \t', datetime.now())
print('Page Size: ', pagsize)
print('Page Count: ', pagcnt)
# Set - Up Query

while pag < pagcnt:
  print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
  r = misp.search(controller='attributes',
    return_format='csv',
    type_attribute=['ip-src', 'ip-dst'],
    tags=['nist-cyber-framework:function="protect"'],
    publish_timestamp='90d',
    metadata=True, 
    pythonify=True, 
    limit=pagsize)
  for e in r:
    attrcnt=attrcnt +1
    print(e)
    #pp.pprint(e)
    #pp.pprint(e.value)
  print('Loop Completed:\t', datetime.now(),"\tAttribute Count:\t", attrcnt - 1, '\nPage:', pag,' of: ', pagcnt)
  pag = pag + 1
print('Ending Action: \t', datetime.now())  

Error message received (full text below):

ProxyError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))

What we've tried to date:

We've tried setting the following on the DataBrick 'client' side

%sh
#==========================
sysctl -w \
     net.ipv4.tcp_keepalive_time=60 \
     net.ipv4.tcp_keepalive_intvl=60 \
     net.ipv4.tcp_keepalive_probes=9
#==========================

We've also tried variations of:

import socket
def setkeepalives(sck):
  sck.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
  sck.setsockopt(IPPROTO_TCP, TCP_KEEPIDLE, 1)
  sck.setsockopt(IPPROTO_TCP, TCP_KEEPINTVL, 3)
  sck.setsockopt(IPPROTO_TCP, TCP_KEEPCNT, 5)
  sck.setdefaulttimeout(7200)

Full text of Error Message received after ~120 seconds


---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    669             # Make the request on the httplib connection object.
--> 670             httplib_response = self._make_request(
    671                 conn,

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    425                     # Otherwise it looks like a bug in the code.
--> 426                     six.raise_from(e, None)
    427         except (SocketTimeout, BaseSSLError, SocketError) as e:

/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    420                 try:
--> 421                     httplib_response = conn.getresponse()
    422                 except BaseException as e:

/usr/lib/python3.8/http/client.py in getresponse(self)
   1347             try:
-> 1348                 response.begin()
   1349             except ConnectionError:

/usr/lib/python3.8/http/client.py in begin(self)
    315         while True:
--> 316             version, status, reason = self._read_status()
    317             if status != CONTINUE:

/usr/lib/python3.8/http/client.py in _read_status(self)
    276     def _read_status(self):
--> 277         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    278         if len(line) > _MAXLINE:

/usr/lib/python3.8/socket.py in readinto(self, b)
    668             try:
--> 669                 return self._sock.recv_into(b)
    670             except timeout:

/usr/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
   1240                   self.__class__)
-> 1241             return self.read(nbytes, buffer)
   1242         else:

/usr/lib/python3.8/ssl.py in read(self, len, buffer)
   1098             if buffer is not None:
-> 1099                 return self._sslobj.read(len, buffer)
   1100             else:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    488             if not chunked:
--> 489                 resp = conn.urlopen(
    490                     method=request.method,

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    725 
--> 726             retries = retries.increment(
    727                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

/databricks/python/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    445         if new_retry.is_exhausted():
--> 446             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    447 

MaxRetryError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))

During handling of the above exception, another exception occurred:

ProxyError                                Traceback (most recent call last)
<command-568716391959277> in <module>
     12 while pag < pagcnt:
     13   print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
---> 14   r = misp.search(controller='attributes',
     15     return_format='csv',
     16     type_attribute=['ip-src', 'ip-dst'],

/databricks/python/lib/python3.8/site-packages/pymisp/api.py in search(self, controller, return_format, limit, page, value, type_attribute, category, org, tags, quick_filter, quickFilter, date_from, date_to, eventid, with_attachments, withAttachments, metadata, uuid, publish_timestamp, last, timestamp, published, enforce_warninglist, enforceWarninglist, to_ids, deleted, include_event_uuid, includeEventUuid, include_event_tags, includeEventTags, event_timestamp, sg_reference_only, eventinfo, searchall, requested_attributes, include_context, includeContext, headerless, include_sightings, includeSightings, include_correlations, includeCorrelations, include_decay_score, includeDecayScore, object_name, exclude_decayed, pythonify, **kwargs)
   2558             response = self._prepare_request('POST', url, data=query, output_type='xml')
   2559         else:
-> 2560             response = self._prepare_request('POST', url, data=query)
   2561 
   2562         if return_format == 'csv':

/databricks/python/lib/python3.8/site-packages/pymisp/api.py in _prepare_request(self, request_type, url, data, params, kw_params, output_type, content_type)
   3587         settings = self.__session.merge_environment_settings(req.url, proxies=self.proxies or {}, stream=None,
   3588                                                              verify=self.ssl, cert=self.cert)
-> 3589         return self.__session.send(prepped, timeout=self.timeout, **settings)
   3590 
   3591     def _csv_to_dict(self, csv_content: str) -> List[dict]:

/databricks/python/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
    699 
    700         # Send the request
--> 701         r = adapter.send(request, **kwargs)
    702 
    703         # Total elapsed time of the request (approximately)

/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    557 
    558             if isinstance(e.reason, _ProxyError):
--> 559                 raise ProxyError(e, request=request)
    560 
    561             if isinstance(e.reason, _SSLError):

packet-rat avatar Jul 17 '22 21:07 packet-rat

Just making sure before I investigate further: have you tried passing a timeout value in the PyMISP call?

See: https://github.com/MISP/PyMISP/blob/bb9f053b6892503851451a5eb94a7127afd6ec8a/pymisp/api.py#L141

Rafiot avatar Jul 17 '22 22:07 Rafiot

Thanks for the timely response. The issue exists with the SSL Proxy between the client and server.
I've tried setting Timeout previously (and again this AM), but it has no impact.

misp = PyMISP(misp_url, misp_key, misp_verifycert, debug=False, timeout=7200)

Starting Action: 	 2022-07-18 14:45:32.483944
Page Size:  80000
Page Count:  200
Starting Loop: 2022-07-18 14:45:32.484146 
Page: 1  of:  200
ProxyError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Max retries exceeded with url: /attributes/restSearch (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset by peer')))
Command took 2.01 minutes -- by [email protected] at 7/18/2022, 10:45:32 AM on TISS Cluster-01

I'm trying to enable/set TCP Keepalive 'tickles" every 60 seconds to keep the proxy connection from timing out.

packet-rat avatar Jul 18 '22 14:07 packet-rat

To show how a "timeout" issue would manifest itself in our environment, I set timeout=30:

ReadTimeout: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)

Full error log:

---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    425                     # Otherwise it looks like a bug in the code.
--> 426                     six.raise_from(e, None)
    427         except (SocketTimeout, BaseSSLError, SocketError) as e:

/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    420                 try:
--> 421                     httplib_response = conn.getresponse()
    422                 except BaseException as e:

/usr/lib/python3.8/http/client.py in getresponse(self)
   1347             try:
-> 1348                 response.begin()
   1349             except ConnectionError:

/usr/lib/python3.8/http/client.py in begin(self)
    315         while True:
--> 316             version, status, reason = self._read_status()
    317             if status != CONTINUE:

/usr/lib/python3.8/http/client.py in _read_status(self)
    276     def _read_status(self):
--> 277         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    278         if len(line) > _MAXLINE:

/usr/lib/python3.8/socket.py in readinto(self, b)
    668             try:
--> 669                 return self._sock.recv_into(b)
    670             except timeout:

/usr/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
   1240                   self.__class__)
-> 1241             return self.read(nbytes, buffer)
   1242         else:

/usr/lib/python3.8/ssl.py in read(self, len, buffer)
   1098             if buffer is not None:
-> 1099                 return self._sslobj.read(len, buffer)
   1100             else:

timeout: The read operation timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError                          Traceback (most recent call last)
/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    488             if not chunked:
--> 489                 resp = conn.urlopen(
    490                     method=request.method,

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    725 
--> 726             retries = retries.increment(
    727                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

/databricks/python/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    409             if read is False or not self._is_method_retryable(method):
--> 410                 raise six.reraise(type(error), error, _stacktrace)
    411             elif read is not None:

/databricks/python/lib/python3.8/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    734                 raise value.with_traceback(tb)
--> 735             raise value
    736         finally:

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    669             # Make the request on the httplib connection object.
--> 670             httplib_response = self._make_request(
    671                 conn,

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    427         except (SocketTimeout, BaseSSLError, SocketError) as e:
--> 428             self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    429             raise

/databricks/python/lib/python3.8/site-packages/urllib3/connectionpool.py in _raise_timeout(self, err, url, timeout_value)
    334         if isinstance(err, SocketTimeout):
--> 335             raise ReadTimeoutError(
    336                 self, url, "Read timed out. (read timeout=%s)" % timeout_value

ReadTimeoutError: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

ReadTimeout                               Traceback (most recent call last)
<command-568716391959277> in <module>
     12 while pag < pagcnt:
     13   print('Starting Loop:', datetime.now(),'\nPage:', pag,' of: ', pagcnt )
---> 14   r = misp.search(controller='attributes',
     15     #return_format='csv',
     16     type_attribute=['ip-src', 'ip-dst'],

/databricks/python/lib/python3.8/site-packages/pymisp/api.py in search(self, controller, return_format, limit, page, value, type_attribute, category, org, tags, quick_filter, quickFilter, date_from, date_to, eventid, with_attachments, withAttachments, metadata, uuid, publish_timestamp, last, timestamp, published, enforce_warninglist, enforceWarninglist, to_ids, deleted, include_event_uuid, includeEventUuid, include_event_tags, includeEventTags, event_timestamp, sg_reference_only, eventinfo, searchall, requested_attributes, include_context, includeContext, headerless, include_sightings, includeSightings, include_correlations, includeCorrelations, include_decay_score, includeDecayScore, object_name, exclude_decayed, pythonify, **kwargs)
   2558             response = self._prepare_request('POST', url, data=query, output_type='xml')
   2559         else:
-> 2560             response = self._prepare_request('POST', url, data=query)
   2561 
   2562         if return_format == 'csv':

/databricks/python/lib/python3.8/site-packages/pymisp/api.py in _prepare_request(self, request_type, url, data, params, kw_params, output_type, content_type)
   3587         settings = self.__session.merge_environment_settings(req.url, proxies=self.proxies or {}, stream=None,
   3588                                                              verify=self.ssl, cert=self.cert)
-> 3589         return self.__session.send(prepped, timeout=self.timeout, **settings)
   3590 
   3591     def _csv_to_dict(self, csv_content: str) -> List[dict]:

/databricks/python/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
    699 
    700         # Send the request
--> 701         r = adapter.send(request, **kwargs)
    702 
    703         # Total elapsed time of the request (approximately)

/databricks/python/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    576                 raise SSLError(e, request=request)
    577             elif isinstance(e, ReadTimeoutError):
--> 578                 raise ReadTimeout(e, request=request)
    579             elif isinstance(e, _InvalidHeader):
    580                 raise InvalidHeader(e, request=request)

ReadTimeout: HTTPSConnectionPool(host='misp.cso.att.com', port=443): Read timed out. (read timeout=30)

packet-rat avatar Jul 18 '22 15:07 packet-rat

Right, the session is killed by the proxy. I had a quick look and it seems that python relies on the system config for the TCP keepalive probes, and you can configure it this way: https://webhostinggeeks.com/howto/configure-linux-tcp-keepalive-setting/

More precisely, this one: net.ipv4.tcp_keepalive_intvl

But it seems your proxy is killing it after 2 min, which isn't the default on linux (should be 75s). Could it be that your proxy ignores the probes and kills the session anyway? Can you tcpdump the link and see if the probe is sent or not?

Rafiot avatar Jul 18 '22 15:07 Rafiot

Please have a look earlier in the thread:

What we've tried to date:

We've tried setting the following on the DataBrick 'client' side

%sh #========================== sysctl -w
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=9 #==========================

Question:

Do I need to enable keepalive specifically in order to activate?

packet-rat avatar Jul 18 '22 15:07 packet-rat

Note that the above values are accepted on the Linux server that executes PyMISP:

%sh

2

sysctl -a 2>/dev/null | grep -i keepalive

net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 60
Command took 0.06 seconds -- by [email protected] at 7/18/2022, 11:40:48 AM on TISS Cluster-01

packet-rat avatar Jul 18 '22 15:07 packet-rat

pcaps will be a little tricky - let me see what I can do...

packet-rat avatar Jul 18 '22 15:07 packet-rat

Alright, sorry, I missed that.

It seems the keepalive config needs to be enabled manually: https://github.com/psf/requests/issues/5430#issuecomment-773738453

In theory, if you try add the code below before you do the PyMISP calls, it could (should?) work.

import socket
from urllib3.connection import HTTPConnection
HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
    ,(socket.SOL_TCP, socket.TCP_KEEPIDLE, 45)
    ,(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10)
    ,(socket.SOL_TCP, socket.TCP_KEEPCNT, 6)
     ]

If it works, I'll add it by default in the library, I think it makes sense.

Rafiot avatar Jul 18 '22 15:07 Rafiot

No impact - same error.

packet-rat avatar Jul 18 '22 16:07 packet-rat

Update: I made some changes to do pcaps and restarted the Databricks cluster. So far it seems to be working (at least gets past 2 minutes without failing)

packet-rat avatar Jul 18 '22 16:07 packet-rat

Oh, good news. Are you running the script with the code above? Or did you change something on the Databricks side?

Rafiot avatar Jul 18 '22 16:07 Rafiot

There's a secondary issue at play here - you shouldn't be getting those long queries either way. We can mark this as resolved if you agree @packet-rat, but we have also identified the root cause for slow paginated queries in the meanwhile.

Basically: Whenever you get a paginated query we actually run a COUNT(*) query with the same parameters without the pagination filters to be able to set the x-result-count header. This usually ends up being excessively slow (close to just running the query unpaginated). So we're thinking how we can avoid that (by having a header that just signals that there's more data left to grab rather than showing an exact count) - so hopefully with some upcoming changes to that the bottleneck will disappear.

iglocska avatar Jul 20 '22 07:07 iglocska

Update:

  • The queries are still timing out. There are some scenarios that run beyond 2 minutes, but ultimately turned out to be related to a small set of data being returned. :-(

image

  • Yes, there are a few performance issues.

(1) We only added the pagination when our 'normal' scripts failed in this new Azure Databricks environment. The code snippet/test above removed pagination.

(2) We are seeing really poor performance when queries are based on Tags that are applied to millions of Attributes. For example, we were applying NIST Cybersecurity Action tags at the Event level (where these assertions were global to that Event's Objects/Attributes). We changed these large data sets to Attribute tags to improve/extend Scoring/Aging algorithms

I was planning to open a separate issue for this Tag performance issue.

packet-rat avatar Jul 20 '22 14:07 packet-rat

Hmm the tag performance is an interesting pointer. It probably runs a stupid query with all enumerated tag occurences used as a where clause.

iglocska avatar Jul 20 '22 14:07 iglocska

Okay, so I tested with a sample code, and this code snippet (https://github.com/MISP/PyMISP/issues/848#issuecomment-1187666336) sends tcp keepalive packets, tcpdump confirms.

And that would confirm what you're saying with the requests that sometimes keep going for more than 2 min.

And one more thing regarding your snippet: I'd recommend to avoid the pythonify switch for huge requests like that, it is going to use a lot of ram.

Rafiot avatar Jul 20 '22 14:07 Rafiot

Thanks - Great TIP!

packet-rat avatar Jul 20 '22 19:07 packet-rat