FlowKit icon indicating copy to clipboard operation
FlowKit copied to clipboard

Handling multiple datasets in FCS file

Open nwzy opened this issue 5 years ago • 8 comments

Hi @whitews, of all the cytometry libs available, I appreciate how straightforward and friendly yours is.

Do you support handling multiple datasets in FlowKit? I know that other projects (fcs/CytoFlow) allow this, but their script truncates the last dataset when reading.

Thanks!

nwzy avatar May 17 '19 20:05 nwzy

@nwzy Thanks for the kind words. FlowKit is a relatively new project and one that I have wanted to develop for several years. Regarding multiple data sets, yes I do plan on supporting this. In fact, the recent Session class is intended to be the programmatic equivalent of a FlowJo workspace, though it isn't yet finished...I am currently swamped with other projects. However, I do plan on getting back to FlowKit in the next week or so. I'm curious to know what use cases and functionality you would like to have in the library?

Also, I'm not sure what you mean that other libraries truncate the last sample, can you elaborate on that problem?

Kind regards, Scott

whitews avatar May 17 '19 21:05 whitews

You're welcome! I'd love to contribute where I can

I'm curious to know what use cases and functionality you would like to have in the library?

We routinely use Merck's easyCyte high-throughput flow cytometer which takes a 96-plate. Some of the manual things that can see others do (that could be streamlined) are:

  • Scrolling through samples to manually compare histograms
  • Screenshot/crop the samples for presentations
  • Compare FCS files of the same samples from different days

The principal scientist and I have managed to at least get use of jupyter notebooks up, but we have to write the code out in pre-made cells and narrowing their inputs to files/dirs that they're interested in.

Also, I'm not sure what you mean that other libraries truncate the last sample, can you elaborate on that problem?

Sure; when grepping a FCS file with multiple samples, all the samples will be there. However, when using the CytoFlow lib (which is dependent on another lib, fcsparser), for some reason it doesn't recognize that last sample. For example:

Grepping the raw FCS as a binary:

(cyto) nwzy@server:~/flow_data$ grep -ao '$SAMPLEID[^$SMNO]*' /my/sample/file.PRO.FCS
$SAMPLEID/Sample0/
$SAMPLEID/Sample1/
$SAMPLEID/Sample2/

But when using the CytoFlow lib:

input:

import cytoflow as flow

tube = flow.Tube(file='/rhome/nwong/guava_data/COCA/050619_nw.PRO.FCS')
import_op = flow.ImportOp(data_set=2, tubes=[tube])
ex = import_op.apply(metadata_only=True)

md = ex.metadata['fcs_metadata']['/my/sample/file.PRO.FCS']['GTI$SAMPLEID']

md

output:

CytoFlow: data_set=2 does not exist

The fastest way around this was to just capture an extra dummy sample at the end

Hope that helps out, Nic

nwzy avatar May 20 '19 21:05 nwzy

Ahh, I misread the issue title. I thought you were asking about handling multiple FCS files under the same workflow, but you are referring to multi-sample FCS files. I have known about these for quite some time, as they are referenced in the FCS specification, but have never run across one. I've wondered if the cytometers we have in our lab can produce these, but they don't allow me to touch them ;) Would be very interested in getting one of these files to provide support for them. Could you send me one?

whitews avatar May 20 '19 22:05 whitews

Here's a good example FCS from the Flow Repository, it has all the standard stuff you'd expect in a FCS3.0 file with multi-sample.

It's not a perfect example since it seems that each software writes out the metadata just a tiny bit differently.

My understanding is that ID (at the beginning of line containing the metadata) and $NEXTDATA are used by the bundled programs to cycle through data, and GTI$SAMPLEID is the user-typed name of the sample

I've wondered if the cytometers we have in our lab can produce these, but they don't allow me to touch them ;)

That's just cruel

nwzy avatar May 20 '19 23:05 nwzy

Thanks for the link, will add this issue to the next milestone.

I've wondered if the cytometers we have in our lab can produce these, but they don't allow me to touch them ;)

That's just cruel

Maybe, but I also don't except pull requests from the biologists ;)

whitews avatar May 20 '19 23:05 whitews

@nwzy I've had some time to look at the file you linked to, and it seems like the file might not be a valid FCS file. There are odd XML fragments in the text section, which might be okay, but they are oddly out of order as if the file's text section has been rewritten by a program that jumbled it. Do you have another example of a multi-data FCS file?

whitews avatar Aug 19 '19 17:08 whitews

Hey @whitews hmm, that's strange...

Let me see if we have some data from an open whitepaper that we can share that, will tag you when I find something

nwzy avatar Aug 21 '19 17:08 nwzy

Reviving this issue as I now have an example file. This will be supported in FlowIO 1.1.

The way this will work is that the FlowIO FlowData class will throw an error upon reading a multi-data file. That error will indicate to use a new utility function in FlowIO that will return multiple FlowData instances for every data set in the file. The FlowKit Sample class will check for this error and indicate a similar workflow. There will be an analogous pass-through utility function in FlowKit to return a list of Sample instances when given a multi-data file.

whitews avatar Jun 04 '22 01:06 whitews