bioframe icon indicating copy to clipboard operation
bioframe copied to clipboard

behavior of on=[]

Open gfudenberg opened this issue 3 years ago • 2 comments

How to infer the space of all possible values of columns passed into on=[] argument. e.g. this arises for implementing complement(..., on=['strand']), which is relied on in subtract.

The simplest solution for inferring all possibilities is by looking at all unique values in these columns. This creates questions:

  1. we need to know the space of all possibilities, even for combinations of ['chrom']+on that are not represented in any interval of the input dataframe. Thus we need a way to specify this space.
  2. we need to specify the behavior for pd.NA values in columns passed to on.

Potential solutions: For (1):

  • require formatting the column as a categorical with all desired possibilities before passing to bioframe functions (as they call groupby). We could provide a utility function to parse/cast strand column as a categorical.
  • develop a new input format, e.g. pass a dictionary: on={‘strand’: (‘-‘, ‘+’, pd.NA)}

For (2), three options for how to deal with missing values in columns passed to on. We could allow the user to select one of these with a flag.

  • drop any intervals with pd.NA in the on column from the operation
  • add any intervals with pd.NA to each group.
  • treat pd.NA as a separate category for groupby

gfudenberg avatar Aug 26 '21 20:08 gfudenberg

For strand column, pd.NA here should actually '.' according to the bioframe specs: https://bioframe.readthedocs.io/en/latest/guide-specifications.html This does not change the logic for some unknown columns, though

agalitsyna avatar Nov 08 '22 03:11 agalitsyna

For the behavior in (2), I'd try to align as close as possible to the native behavior of applying df.groupby() to a categorical column where some instances of an allowed categorical value are missing.

nvictus avatar Apr 03 '23 21:04 nvictus