bioframe icon indicating copy to clipboard operation
bioframe copied to clipboard

default dtypes & core.construction

Open smitkadvani opened this issue 1 year ago • 4 comments

looking at tests, there is a lot of boilerplate that could be reduced, and tests could be made more readable, if we could specify dtypes for functions in core.construction (including from_any, from_list, and from_series).

for example:

    df1 = pd.DataFrame(
        [
            ['chr1', 1, 1]
        ],
        columns=['chrom','start','end']
    ).astype({"start": pd.Int64Dtype(), "end": pd.Int64Dtype()})

would become

df1 = bf.from_any(['chr1', 1, 1])

We provide a dictionary for default columns names in core.specs, however there does not seem to be a dictionary (or other specification) for default dtypes.

One option would be to add them right after the default column names in core.specs: https://github.com/open2c/bioframe/blob/main/bioframe/core/specs.py#L11C1-L12C1

If added, should they be int, pd.Int64Dtype(), or something else for start and end?

smitkadvani avatar Feb 12 '24 23:02 smitkadvani

@nvictus @golobor any suggestions? I guess the idea would be:

_rc = {"colnames": {"chrom": "chrom", "start": "start", "end": "end"},
    "col_dtpyes": {"chrom": str, "start": pd.Int64Dtype(), "end": pd.Int64Dtype()}
}

and something like:

def from_any(regions, fill_null=False, name_col="name", cols=None, col_dtypes=None):
    ck1_dtype, sk1_dtype, ek1_dtype = _get_default_col_dtypes() if cols is None else cols

gfudenberg avatar Mar 13 '24 11:03 gfudenberg

Why col_dtypes as opposed to dtypes?

nvictus avatar Mar 20 '24 18:03 nvictus

I guess we could say the same thing about colnames vs cols... as keys in the _rc dictionary (which could perhaps also get a less cryptic name)?

gfudenberg avatar Mar 20 '24 18:03 gfudenberg

rc is borrowed from matplotlib which borrowed it from Linux: https://stackoverflow.com/questions/11030552/what-does-rc-mean-in-dot-files . It doesn't have to be named like that, can be, for example, _conf.

pd.Int64Dtype() is probably better than int, right? I think we use it quite extensively throughout the library, so we might as well make it the default - this way we'll have more consistent NaNs.

names/dtypes is fine with me! We can also use a nested dict or a SimpleNamespace/NamedTuple (not bad, since we only need to read/modify variables with existing keys):

_conf = dict(
    names = dict(chrom='chrom', start='start', end='end'),
    dtypes = dict(...)
)

or:

import types
_conf = types.SimpleNamespace()
_conf.col =  types.SimpleNamespace()
_conf.col.names =  types.SimpleNamespace()
_conf.col.names.chrom=chrom
...

golobor avatar Mar 20 '24 21:03 golobor