bioframe
bioframe copied to clipboard
default dtypes & core.construction
looking at tests, there is a lot of boilerplate that could be reduced, and tests could be made more readable, if we could specify dtypes for functions in core.construction (including from_any
, from_list
, and from_series
).
for example:
df1 = pd.DataFrame(
[
['chr1', 1, 1]
],
columns=['chrom','start','end']
).astype({"start": pd.Int64Dtype(), "end": pd.Int64Dtype()})
would become
df1 = bf.from_any(['chr1', 1, 1])
We provide a dictionary for default columns names in core.specs
, however there does not seem to be a dictionary (or other specification) for default dtypes.
One option would be to add them right after the default column names in core.specs
:
https://github.com/open2c/bioframe/blob/main/bioframe/core/specs.py#L11C1-L12C1
If added, should they be int
, pd.Int64Dtype()
, or something else for start
and end
?
@nvictus @golobor any suggestions? I guess the idea would be:
_rc = {"colnames": {"chrom": "chrom", "start": "start", "end": "end"},
"col_dtpyes": {"chrom": str, "start": pd.Int64Dtype(), "end": pd.Int64Dtype()}
}
and something like:
def from_any(regions, fill_null=False, name_col="name", cols=None, col_dtypes=None):
ck1_dtype, sk1_dtype, ek1_dtype = _get_default_col_dtypes() if cols is None else cols
Why col_dtypes
as opposed to dtypes
?
I guess we could say the same thing about colnames
vs cols
... as keys in the _rc dictionary (which could perhaps also get a less cryptic name)?
rc is borrowed from matplotlib which borrowed it from Linux: https://stackoverflow.com/questions/11030552/what-does-rc-mean-in-dot-files . It doesn't have to be named like that, can be, for example, _conf.
pd.Int64Dtype() is probably better than int, right? I think we use it quite extensively throughout the library, so we might as well make it the default - this way we'll have more consistent NaNs.
names/dtypes is fine with me! We can also use a nested dict or a SimpleNamespace/NamedTuple (not bad, since we only need to read/modify variables with existing keys):
_conf = dict(
names = dict(chrom='chrom', start='start', end='end'),
dtypes = dict(...)
)
or:
import types
_conf = types.SimpleNamespace()
_conf.col = types.SimpleNamespace()
_conf.col.names = types.SimpleNamespace()
_conf.col.names.chrom=chrom
...