mwdb-core icon indicating copy to clipboard operation
mwdb-core copied to clipboard

Karton reanalysis API is slow

Open phretor opened this issue 3 years ago • 3 comments

I'm not sure whether this is an issue with the API server or the MWDB client. I'm using the following code to re-analyze all samples matching a query:


@retry(**retry_opts)
def get_count(mwdb: MWDB, q: str) -> int:
    logger.info("Counting files matching '{}'", q)
    return mwdb.count_files(q)


@retry(**retry_opts)
def fetch_files(mwdb: MWDB, q: str) -> Iterator[MWDBFile]:
    return mwdb.search_files(q)


@retry(**retry_opts)
def reanalyze(obj: MWDBObject) -> None:
    obj.reanalyze()


def _reanalyze(obj: MWDBObject) -> str:
    try:
        reanalyze(obj)
        return obj.sha256
    except Exception:
        logger.opt(exception=True).error(
            "{} max retries limit exceeded. Skipping.", obj.sha256
        )


@retry(logger=logger)
def do_work(q: str, n_procs: int):
    mwdb = MWDB()
    total: Optional[int] = None
    files: Optional[Iterator[MWDBFile]] = None

    try:
        total = get_count(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[get_count] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching the number of files")

    if total == 0:
        return 0

    logger.info("Found {} files matching the query", total)

    try:
        files = fetch_files(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[fetch_files] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching files")

    with tqdm(total=total) as bar:
        for obj in files:
            bar.write(_reanalyze(obj))
            bar.update()

    return 0

def main():
    do_work('NOT tag:"foo"', 1)

I see it takes 5-10 seconds to do one iteration, which is a lot. The MWDB API is deployed with default options, using the recommended Docker Compose file, so it's one Nginx frontend and 4 uWSGI backends. The machine is doing nothing and is not experiencing load.

I'm trying to understand whether the bottleneck is the iteration over the files iterator, or the way I submit files. I see the iteration is technically doing a self.api.get(object_type.URL_TYPE, params=params) in the end, so that may be the bottleneck. But why so slow?

I guess there are no bulk methods in the API, right?

phretor avatar Aug 04 '22 10:08 phretor

After some investigation, if I comment out the obj.reanalyze() I can confirm that the iteration itself doesn't take a lot of time.

The bottleneck seems here:

    def reanalyze(
        self, arguments: Optional[Dict[str, Any]] = None
    ) -> "MWDBKartonAnalysis":
        """
        Submits new Karton analysis for given object.

        Requires MWDB Core >= 2.3.0.

        :param arguments: |
            Optional, additional arguments for analysis.
            Reserved for future functionality.

        .. versionadded:: 4.0.0
        """
        from .karton import MWDBKartonAnalysis

        arguments = {"arguments": arguments or {}}
        analysis = self.api.post(
            "object/{id}/karton".format(**self.data), json=arguments
        )
        self._expire("analyses")
        return MWDBKartonAnalysis(self.api, analysis)

that is, the POST request is blocking.

phretor avatar Aug 04 '22 10:08 phretor

I think that bottleneck is an API and gathering metadata about created analysis (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/resources/karton.py#L130) including status, last_update and processing_in (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/model/karton.py#L61). And here comes the huge weakness of current model: we need to iterate over all tasks currently processing in Karton (get_karton_state) to check the metadata about task tree. That's why if we do massive reanalysis, it's getting slower and slower.

That problem is already referenced in another issue in Karton itself: https://github.com/CERT-Polska/karton/issues/178

So there are two solutions for that:

  • speed-up the analysis (task tree) status inspection in Karton
  • not return the analysis status in reanalysis endpoint response and just return 200 OK if reanalysis was spawned correctly.

psrok1 avatar Aug 19 '22 11:08 psrok1

We're actually going to speed up analysis status inspection soon: https://github.com/CERT-Polska/karton/pull/207

psrok1 avatar Mar 16 '23 18:03 psrok1