mwdb-core Karton reanalysis API is slow

I'm not sure whether this is an issue with the API server or the MWDB client. I'm using the following code to re-analyze all samples matching a query:


@retry(**retry_opts)
def get_count(mwdb: MWDB, q: str) -> int:
    logger.info("Counting files matching '{}'", q)
    return mwdb.count_files(q)


@retry(**retry_opts)
def fetch_files(mwdb: MWDB, q: str) -> Iterator[MWDBFile]:
    return mwdb.search_files(q)


@retry(**retry_opts)
def reanalyze(obj: MWDBObject) -> None:
    obj.reanalyze()


def _reanalyze(obj: MWDBObject) -> str:
    try:
        reanalyze(obj)
        return obj.sha256
    except Exception:
        logger.opt(exception=True).error(
            "{} max retries limit exceeded. Skipping.", obj.sha256
        )


@retry(logger=logger)
def do_work(q: str, n_procs: int):
    mwdb = MWDB()
    total: Optional[int] = None
    files: Optional[Iterator[MWDBFile]] = None

    try:
        total = get_count(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[get_count] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching the number of files")

    if total == 0:
        return 0

    logger.info("Found {} files matching the query", total)

    try:
        files = fetch_files(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[fetch_files] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching files")

    with tqdm(total=total) as bar:
        for obj in files:
            bar.write(_reanalyze(obj))
            bar.update()

    return 0

def main():
    do_work('NOT tag:"foo"', 1)

I see it takes 5-10 seconds to do one iteration, which is a lot. The MWDB API is deployed with default options, using the recommended Docker Compose file, so it's one Nginx frontend and 4 uWSGI backends. The machine is doing nothing and is not experiencing load.

I'm trying to understand whether the bottleneck is the iteration over the files iterator, or the way I submit files. I see the iteration is technically doing a self.api.get(object_type.URL_TYPE, params=params) in the end, so that may be the bottleneck. But why so slow?

I guess there are no bulk methods in the API, right?

Aug 04 '22 10:08 phretor

After some investigation, if I comment out the obj.reanalyze() I can confirm that the iteration itself doesn't take a lot of time.

The bottleneck seems here:

    def reanalyze(
        self, arguments: Optional[Dict[str, Any]] = None
    ) -> "MWDBKartonAnalysis":
        """
        Submits new Karton analysis for given object.

        Requires MWDB Core >= 2.3.0.

        :param arguments: |
            Optional, additional arguments for analysis.
            Reserved for future functionality.

        .. versionadded:: 4.0.0
        """
        from .karton import MWDBKartonAnalysis

        arguments = {"arguments": arguments or {}}
        analysis = self.api.post(
            "object/{id}/karton".format(**self.data), json=arguments
        )
        self._expire("analyses")
        return MWDBKartonAnalysis(self.api, analysis)

that is, the POST request is blocking.

Aug 04 '22 10:08 phretor

I think that bottleneck is an API and gathering metadata about created analysis (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/resources/karton.py#L130) including status, last_update and processing_in (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/model/karton.py#L61). And here comes the huge weakness of current model: we need to iterate over all tasks currently processing in Karton (get_karton_state) to check the metadata about task tree. That's why if we do massive reanalysis, it's getting slower and slower.

That problem is already referenced in another issue in Karton itself: https://github.com/CERT-Polska/karton/issues/178

So there are two solutions for that:

speed-up the analysis (task tree) status inspection in Karton
not return the analysis status in reanalysis endpoint response and just return 200 OK if reanalysis was spawned correctly.

Aug 19 '22 11:08 psrok1

We're actually going to speed up analysis status inspection soon: https://github.com/CERT-Polska/karton/pull/207

Mar 16 '23 18:03 psrok1

mwdb-core mwdb-core copied to clipboard

Karton reanalysis API is slow

mwdb-core
mwdb-core copied to clipboard