sssom-py icon indicating copy to clipboard operation
sssom-py copied to clipboard

`.clean_prefix_map()`: `remove_unused_builtins` param

Open joeflack4 opened this issue 1 year ago • 6 comments

Overview

Given that built-in prefix maps are not required unless they appear somewhere in the mapping set, it would be nice to have a way to give users the ability to create an msdf and an SSSOM TSV that leaves these out.

joeflack4 avatar May 23 '24 23:05 joeflack4

I would also like to have this option. Otherwise nearly every SSSOM TSV I create is more prefix map than anything else, by lines.

caufieldjh avatar May 30 '24 21:05 caufieldjh

@caufieldjh you must have small mapping sets then - there are only about 6 built in prefixes.. I do agree though with this issue, and happy to incorporate a parameter like this.

matentzn avatar May 31 '24 09:05 matentzn

I was observing the same effect @joeflack4 did - when instantiating a MappingSetDataFrame without a Converter, the included prefix map included all ~1.5K prefixes and the object takes some time to create. If I do it more like this, instantiation is instantaneous:

    metadata = get_default_metadata()
    metadata['curie_map'] = {}
    converter = Converter.from_prefix_map(metadata['curie_map'])
    msdf = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)

caufieldjh avatar May 31 '24 13:05 caufieldjh

@caufieldjh this is a slightly different ask: you are saying that when you instantiate with just a converter, you would like by default to run "clean prefix map".

I think I can see that could be a good idea - but its not the same as the parameter being asked for here.

matentzn avatar May 31 '24 14:05 matentzn

I still see this as related - the call to clean_prefix_map doesn't have to be a default, as there are certainly cases in which it's useful to start with a prefix map including builtins, add/merge new mappings into the msdf, then do the final cleanup step to remove unused prefixes (working under the assumption that some of those prefixes are now present in the mappings, too).

caufieldjh avatar May 31 '24 20:05 caufieldjh

I guess it is slightly different but related. I think that actually what @caufieldjh is referring to is actually just a bug, because there are 1,547 prefixes included in the output when that happens.

In any case, just linking this related stuff here:

  • The 1,547 (not just built-ins) prefixes issue @caufieldjh describes is outlined in sub-task 2.2 (point 3) of: #513
  • I believe it also appears during one of the test scenarios in: #533

joeflack4 avatar May 31 '24 20:05 joeflack4