mwdb-core
mwdb-core copied to clipboard
Karton reanalysis API is slow
I'm not sure whether this is an issue with the API server or the MWDB client. I'm using the following code to re-analyze all samples matching a query:
@retry(**retry_opts)
def get_count(mwdb: MWDB, q: str) -> int:
logger.info("Counting files matching '{}'", q)
return mwdb.count_files(q)
@retry(**retry_opts)
def fetch_files(mwdb: MWDB, q: str) -> Iterator[MWDBFile]:
return mwdb.search_files(q)
@retry(**retry_opts)
def reanalyze(obj: MWDBObject) -> None:
obj.reanalyze()
def _reanalyze(obj: MWDBObject) -> str:
try:
reanalyze(obj)
return obj.sha256
except Exception:
logger.opt(exception=True).error(
"{} max retries limit exceeded. Skipping.", obj.sha256
)
@retry(logger=logger)
def do_work(q: str, n_procs: int):
mwdb = MWDB()
total: Optional[int] = None
files: Optional[Iterator[MWDBFile]] = None
try:
total = get_count(mwdb, q)
except Exception:
logger.opt(exception=True).error(
"[get_count] Max number of retries (3). Quitting"
)
raise Exception("Error fetching the number of files")
if total == 0:
return 0
logger.info("Found {} files matching the query", total)
try:
files = fetch_files(mwdb, q)
except Exception:
logger.opt(exception=True).error(
"[fetch_files] Max number of retries (3). Quitting"
)
raise Exception("Error fetching files")
with tqdm(total=total) as bar:
for obj in files:
bar.write(_reanalyze(obj))
bar.update()
return 0
def main():
do_work('NOT tag:"foo"', 1)
I see it takes 5-10 seconds to do one iteration, which is a lot. The MWDB API is deployed with default options, using the recommended Docker Compose file, so it's one Nginx frontend and 4 uWSGI backends. The machine is doing nothing and is not experiencing load.
I'm trying to understand whether the bottleneck is the iteration over the files iterator, or the way I submit files. I see the iteration is technically doing a self.api.get(object_type.URL_TYPE, params=params) in the end, so that may be the bottleneck. But why so slow?
I guess there are no bulk methods in the API, right?
After some investigation, if I comment out the obj.reanalyze() I can confirm that the iteration itself doesn't take a lot of time.
The bottleneck seems here:
def reanalyze(
self, arguments: Optional[Dict[str, Any]] = None
) -> "MWDBKartonAnalysis":
"""
Submits new Karton analysis for given object.
Requires MWDB Core >= 2.3.0.
:param arguments: |
Optional, additional arguments for analysis.
Reserved for future functionality.
.. versionadded:: 4.0.0
"""
from .karton import MWDBKartonAnalysis
arguments = {"arguments": arguments or {}}
analysis = self.api.post(
"object/{id}/karton".format(**self.data), json=arguments
)
self._expire("analyses")
return MWDBKartonAnalysis(self.api, analysis)
that is, the POST request is blocking.
I think that bottleneck is an API and gathering metadata about created analysis (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/resources/karton.py#L130) including status, last_update and processing_in (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/model/karton.py#L61). And here comes the huge weakness of current model: we need to iterate over all tasks currently processing in Karton (get_karton_state) to check the metadata about task tree. That's why if we do massive reanalysis, it's getting slower and slower.
That problem is already referenced in another issue in Karton itself: https://github.com/CERT-Polska/karton/issues/178
So there are two solutions for that:
- speed-up the analysis (task tree) status inspection in Karton
- not return the analysis status in reanalysis endpoint response and just return 200 OK if reanalysis was spawned correctly.
We're actually going to speed up analysis status inspection soon: https://github.com/CERT-Polska/karton/pull/207