sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

How do I add a new sample or variant data variable from a file?

Open hammer opened this issue 2 years ago • 1 comments

We have How do I define a new variable based on others?, and we have Adding column fields in the GWAS docs. I think we could expand this documentation to include the case when a sample annotation file doesn't contain a matching index, for example. I always forget the deal with dims and coords and whatnot, and ensuring you add data variables as Dask arrays that match the chunking of the other data variables is useful too.

hammer avatar Dec 02 '23 22:12 hammer

Here's an example that took me way too long to figure out.

I wanted to add one column of sample metadata to my dataset. The file with the new column is very simple, it just has three-letter sample ancestry with no header and no sample id, e.g.

AFR
AFR
AFR
AMR
AMR
AMR
EAS
EAS
...

In the GWAS tutorial, we have a full dataframe with sample id as index and several new columns, so we make a new dataset and use merge.

In this case, we don't need a full merge. Here's what I ended up doing:

ancestry_file = 'gs://hapnest/example/'+file_base+'.sample'
df = pd.read_csv(ancestry_file, header=None)

# Make dask array from df and add to ds as a new data variable called "sample_ancestry" with the "samples" dimension
# Is there a better dtype to use than "object"?
ancestry_da = da.from_array(df[0].values, chunks=(600,))

ds = ds.assign(sample_ancestry=('samples', ancestry_da))

What made this tricky for me:

  1. Make the data variable a Dask array with the same chunking as the other data variables. Quite easy but something new users might not know about.
  2. Figuring out the syntax of assign that would align the new data variable with the samples dimension.

Ultimately I think this was hard because the xarray docs for assign don't have enough examples. We can add more examples in our docs and maybe make some changes upstream.

hammer avatar Dec 03 '23 02:12 hammer