sssom-py icon indicating copy to clipboard operation
sssom-py copied to clipboard

Unnecessary repeated iteration in `split_dataframe_by_prefix`

Open ptgolden opened this issue 6 months ago • 7 comments

I've been trying to track some performance issues in the mondo pipeline, and one bottleneck I've identified is in a function that calls sssom.parsers.split_dataframe_by_prefix. Specifically, this part here:

https://github.com/mapping-commons/sssom-py/blob/cf75d07cfea233718331365675a27dfda7bf80ba/src/sssom/parsers.py#L980-L1022

(full implementation here)

The issue is a bit more clear before the refactor in 1fc45eceb776546c6ea028c81f37ca30cc6a8710:

    for pre_subj in subject_prefixes:
        for pre_obj in object_prefixes:
            for rel in relations:
...
                if df is not None:
                    dfs = df[
                        (df[SUBJECT_ID].str.startswith(pre_subj + ":"))
                        & (df[PREDICATE_ID] == rel)
                        & (df[OBJECT_ID].str.startswith(pre_obj + ":"))
                    ]

The issue is: the entire MappingSetDataFrame is iterated over for the product of len(subject_prefix) * len(object_prefix) * len(relations). Each iteration involves creating a new string and calling .startswith twice. While these are small operations, they add up in the end. (It takes ~10 minutes in the case I'm looking at). There might be a big performance gain from only going through the MappingSetDataFrame once.

A quickly written attempt that I can formalize as a PR if the approach looks OK:

    @dataclass
    class SSSOMTriple:
        subject_prefix: str
        object_prefix: str
        relation: str

        def __post_init__(self):
            relation_prefix, relation_id = self.relation.split(":")
            self.relation_prefix = relation_prefix
            self.relation_id = relation_id

        def as_identifier(self):
            return f"{self.subject_prefix.lower()}_{self.relation_id.lower()}_{object_prefix.lower()}"

    splits_by_triple: dict[SSSOMTriple, list[dict]] = {}

    # Still iterate through the product of SxPxO initially to pre-populate a dict
    for subject_prefix, object_prefix, relation in itt.product(
        subject_prefixes, object_prefixes, relations
    ):
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        split = triple.as_identifier()
        if subject_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
            continue
        if object_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing object prefix - {object_prefix}")
            continue
        splits_by_triple[triple] = []

    # Iterate through `msdf.df` once, building up a list of matches in the pre-populated dict
    for entry in df.itertuples(index=False):
        _entry = entry._asdict()
        subject_prefix = entry[SUBJECT_ID].split(":")[0]
        object_prefix = entry[OBJECT_ID].split(":")[0]
        relation = entry[PREDICATE_ID]
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        if triple in splits_by_triple:
            splits_by_triple[triple].append(_entry)

    # Iterate through the prepopulated dict
    for triple, values in splits_by_triple.items():
        split = triple.as_identifier()
        if len(values) == 0:
            logging.debug(f"{split} - No matches (0 matches found)")
            continue
        subconverter = msdf.converter.get_subconverter(
            [triple.subject_prefix, triple.object_prefix, triple.relation_prefix]
        )
        split_to_msdf[split] = from_sssom_dataframe(
            pd.DataFrame(values), prefix_map=dict(subconverter.bimap), meta=meta
        )

    return split_to_msdf

This would drastically reduce the performance of the algorithm as the size of msdf.df increases (if there are non-trival amounts of subject_prefix/object_prefix/relation iterables). In my case, it went from ~10m to nearly instant with a similar approach. The only added complexity is constructing the SSSOMTriple object for every row in the dataframe and every individual of the SxPxO product, but this is not computationally complex-- it should not add a performance penalty in any case.

ptgolden avatar Sep 04 '25 16:09 ptgolden

@ptgolden would you be willing to make a PR? THANKS!

matentzn avatar Sep 04 '25 17:09 matentzn

Happy to. I'm nervous that there aren't unit tests for sssom.parsers.split_dataframe_by_prefix or sssom.parsers.split_dataframe. The test for sssom.cli.split should cover a good chunk, but may miss some edge cases.

ptgolden avatar Sep 04 '25 17:09 ptgolden

If you could add a test, I will do my best to review it for potential pitfalls!

matentzn avatar Sep 04 '25 17:09 matentzn

hi both, I've been thinking about how to generalize CURIE- and prefix-filtering operations in dataframes for some time, and have opened up a parallel PR in the curies package (https://github.com/biopragmatics/curies/pull/179), where I will try and make more efficient implementations of these operations that can be reused in SSSOM.

Though, please don't let this get in the way of making your own PR to SSSOM @ptgolden, that would be highly appreciated!

cthoyt avatar Sep 05 '25 09:09 cthoyt

Very useful! Thanks @cthoyt and @ptgolden!

matentzn avatar Sep 05 '25 11:09 matentzn

PR in #609. This removes basically all of the overhead in the chunk I identified in the issue. Almost all of the compute time left is in parsers.from_sssom_dataframe, which is what it is.

Nice, @cthoyt! I imagine this function might be taken out in the future and replaced with your implementation.

ptgolden avatar Sep 05 '25 15:09 ptgolden

Current status on this issues is that there's an open PR https://github.com/mapping-commons/sssom-py/pull/611 for comparing several implementations, waiting on Patrick to address feedback

cthoyt avatar Oct 09 '25 10:10 cthoyt