sunbeam icon indicating copy to clipboard operation
sunbeam copied to clipboard

list_samples.py

Open mweberr opened this issue 1 year ago • 5 comments

Hi, can you please help me to understand the following lines 91-92 in list_samples.py

files = list(listfiles(str(data_fp/filename_fmt)))
    Samples = {t[1]['sample']: {} for t in files}

The snakemake utils returns a list of tuples. Then you use dictionary comprehension to build a dictionary by accessing the sample entry of the tuple. Question : If I screen for fastq files I dont get a sample entry in the tuple. Why ?

Thank you for any explanation! I try to debug my code and want to find out how sample lists are generated.

Best, Michael

mweberr avatar Jul 15 '22 16:07 mweberr

Hi Michael,

These lines are actually pretty deceptive because of some of the weird data structures used. I'm just gonna start by listing out data types of everything involved then see if I can help with why it's not picking up fastq files:

  • data_fp: Path
  • listfiles: generator object (yields instead of returning, this is why it's listified)
  • files: list
  • t: tuple
  • t[1]: snakemake.io.NamedList (haven't looked into why this isn't just a dict, but in this case it seems to behave the same)
  • t[1]['sample']: str

When I run this on a directory with fastq samples it seems to work (sunbeam init --data_fp ../dummy-samples/ ./ with samples named Sample1_R1.fastq, Sample1_R2.fastq, Sample2_R1.fastq, ...). It should output something like this:

Guessing sample name format from files in /path/to/dummy-samples...
  Best guess: {sample}_R{rp}.fastq

Try opening up a python session, run from snakemake.utils import listfiles, and then list(listfiles("/path/to/dummy-samples/{sample}_R{rp}.fastq")). That should give you a list of tuples that look like this:

('/path/to/dummy-samples/Sample1_R1.fastq', ['Sample1', '1'])

If it's not working there's probably something wrong with the path or the inferred sample format that you'll need to fix. Let me know if this helps.

Charlie

Ulthran avatar Jul 15 '22 18:07 Ulthran

Hi Charlie, thanks for this great help. It really took a while to digest the different variable types. But know I found the issue. For me I would like to guess ID from fastq files formatted like INFO_ID_LIBID_NUMBERS_EXT.fastq.gz

To achieve this with the Snakemake pattern matching I would need to define something like {A}{sample}{B}{C}{ext}.fastq.gz

However this is not straight-forward, because every wildcard needs a well-defined string, no numbers or special characters are allowed. I am wondering if this is possible with the Snakemake wildcards and listfiles function ? What do you think ?

Best, Michael

mweberr avatar Jul 18 '22 12:07 mweberr

Are all of those fields necessary to uniquely identify each sample? i.e. do your files look like this:

1_A_1_1_R1.fastq, 1_A_1_1_R2.fastq, 2_A_1_1_R..., 2_A_2_1_R..., 2_A_2_2_R..., ...

or like this:

info_A_1_123_R1.fastq, info_A_1_123_R2.fastq, info_B_1_123_R1.fastq, info_B_1_123_R2.fastq, info_C_..., ...

or some middle ground between these two? In the first case sunbeam init should guess something like {sample}_R{rp}.fastq as the pattern and just lump everything before the read pair together. In the second case it should come out as info_{sample}_1_123_R{rp}.fastq only pulling out the unique part for the sample id.

If this is missing the point of your question, could you provide a list of example sample file names?

Charlie

Ulthran avatar Jul 18 '22 14:07 Ulthran

Here is an example list of sample names:

NA-27750_ID1200_lib8888_7504_1_1.fastq.gz
NA-27750_ID1200_lib8888_7504_1_2.fastq.gz
NA-27790_ID1201_lib9999_7504_1_2.fastq.gz
NA-27790_ID1201_lib9999_7504_1_2.fastq.gz
NA-27790_ID1202_lib9999_7504_1_2.fastq.gz
NA-27790_ID1202_lib9999_7504_1_2.fastq.gz

{A}_{sample}_{lib}_{B}_{C}_{rep}.fastq.gz

The A part varies in combination with C, while I have not figured out the meaning of A and C.

mweberr avatar Jul 18 '22 15:07 mweberr

What happens when you run sunbeam init --data_fp /path/to/files /project/fp on these?

Ulthran avatar Jul 18 '22 16:07 Ulthran

Hi Michael,

I'm going to close out this issue. I'd advise just using the default pattern matching from sunbeam init --data_fp and if you need to rename products that's easy enough to do at the end. There's no need to force pattern matching on a specific part of the sample name if the whole name also uniquely identifies it.

Thanks, Charlie

Ulthran avatar Sep 30 '22 17:09 Ulthran