Unnecessary repeated iteration in `split_dataframe_by_prefix`
I've been trying to track some performance issues in the mondo pipeline, and one bottleneck I've identified is in a function that calls sssom.parsers.split_dataframe_by_prefix. Specifically, this part here:
https://github.com/mapping-commons/sssom-py/blob/cf75d07cfea233718331365675a27dfda7bf80ba/src/sssom/parsers.py#L980-L1022
(full implementation here)
The issue is a bit more clear before the refactor in 1fc45eceb776546c6ea028c81f37ca30cc6a8710:
for pre_subj in subject_prefixes:
for pre_obj in object_prefixes:
for rel in relations:
...
if df is not None:
dfs = df[
(df[SUBJECT_ID].str.startswith(pre_subj + ":"))
& (df[PREDICATE_ID] == rel)
& (df[OBJECT_ID].str.startswith(pre_obj + ":"))
]
The issue is: the entire MappingSetDataFrame is iterated over for the product of len(subject_prefix) * len(object_prefix) * len(relations). Each iteration involves creating a new string and calling .startswith twice. While these are small operations, they add up in the end. (It takes ~10 minutes in the case I'm looking at). There might be a big performance gain from only going through the MappingSetDataFrame once.
A quickly written attempt that I can formalize as a PR if the approach looks OK:
@dataclass
class SSSOMTriple:
subject_prefix: str
object_prefix: str
relation: str
def __post_init__(self):
relation_prefix, relation_id = self.relation.split(":")
self.relation_prefix = relation_prefix
self.relation_id = relation_id
def as_identifier(self):
return f"{self.subject_prefix.lower()}_{self.relation_id.lower()}_{object_prefix.lower()}"
splits_by_triple: dict[SSSOMTriple, list[dict]] = {}
# Still iterate through the product of SxPxO initially to pre-populate a dict
for subject_prefix, object_prefix, relation in itt.product(
subject_prefixes, object_prefixes, relations
):
triple = SSSOMTriple(subject_prefix, object_prefix, relation)
split = triple.as_identifier()
if subject_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
continue
if object_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing object prefix - {object_prefix}")
continue
splits_by_triple[triple] = []
# Iterate through `msdf.df` once, building up a list of matches in the pre-populated dict
for entry in df.itertuples(index=False):
_entry = entry._asdict()
subject_prefix = entry[SUBJECT_ID].split(":")[0]
object_prefix = entry[OBJECT_ID].split(":")[0]
relation = entry[PREDICATE_ID]
triple = SSSOMTriple(subject_prefix, object_prefix, relation)
if triple in splits_by_triple:
splits_by_triple[triple].append(_entry)
# Iterate through the prepopulated dict
for triple, values in splits_by_triple.items():
split = triple.as_identifier()
if len(values) == 0:
logging.debug(f"{split} - No matches (0 matches found)")
continue
subconverter = msdf.converter.get_subconverter(
[triple.subject_prefix, triple.object_prefix, triple.relation_prefix]
)
split_to_msdf[split] = from_sssom_dataframe(
pd.DataFrame(values), prefix_map=dict(subconverter.bimap), meta=meta
)
return split_to_msdf
This would drastically reduce the performance of the algorithm as the size of msdf.df increases (if there are non-trival amounts of subject_prefix/object_prefix/relation iterables). In my case, it went from ~10m to nearly instant with a similar approach. The only added complexity is constructing the SSSOMTriple object for every row in the dataframe and every individual of the SxPxO product, but this is not computationally complex-- it should not add a performance penalty in any case.
@ptgolden would you be willing to make a PR? THANKS!
Happy to. I'm nervous that there aren't unit tests for sssom.parsers.split_dataframe_by_prefix or sssom.parsers.split_dataframe. The test for sssom.cli.split should cover a good chunk, but may miss some edge cases.
If you could add a test, I will do my best to review it for potential pitfalls!
hi both, I've been thinking about how to generalize CURIE- and prefix-filtering operations in dataframes for some time, and have opened up a parallel PR in the curies package (https://github.com/biopragmatics/curies/pull/179), where I will try and make more efficient implementations of these operations that can be reused in SSSOM.
Though, please don't let this get in the way of making your own PR to SSSOM @ptgolden, that would be highly appreciated!
Very useful! Thanks @cthoyt and @ptgolden!
PR in #609. This removes basically all of the overhead in the chunk I identified in the issue. Almost all of the compute time left is in parsers.from_sssom_dataframe, which is what it is.
Nice, @cthoyt! I imagine this function might be taken out in the future and replaced with your implementation.
Current status on this issues is that there's an open PR https://github.com/mapping-commons/sssom-py/pull/611 for comparing several implementations, waiting on Patrick to address feedback