rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Custom error messages for file schema validation

Open ewels opened this issue 3 years ago • 9 comments

Working on adding the ability for pipeline devs to write nicer / more user friendly error message strings to go into the validation.

ewels avatar May 12 '21 14:05 ewels

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 835c3e2

+| ✅ 115 tests passed       |+
#| ❔  17 tests were ignored |#
!| ❗  78 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

  • files_exist - File not found: environment.yml
  • files_exist - File not found: Dockerfile
  • nextflow_config - Config variable not found: process.container
  • nextflow_config - Config manifest.version should end in dev: '3.1'
  • params_used - Config variable not found in main.nf: params.input
  • params_used - Config variable not found in main.nf: params.skip_sra_fastq_download
  • params_used - Config variable not found in main.nf: params.genome
  • params_used - Config variable not found in main.nf: params.transcript_fasta
  • params_used - Config variable not found in main.nf: params.additional_fasta
  • params_used - Config variable not found in main.nf: params.splicesites
  • params_used - Config variable not found in main.nf: params.gtf_extra_attributes
  • params_used - Config variable not found in main.nf: params.gtf_group_features
  • params_used - Config variable not found in main.nf: params.featurecounts_feature_type
  • params_used - Config variable not found in main.nf: params.featurecounts_group_type
  • params_used - Config variable not found in main.nf: params.gencode
  • params_used - Config variable not found in main.nf: params.save_reference
  • params_used - Config variable not found in main.nf: params.with_umi
  • params_used - Config variable not found in main.nf: params.umitools_extract_method
  • params_used - Config variable not found in main.nf: params.umitools_bc_pattern
  • params_used - Config variable not found in main.nf: params.save_umi_intermeds
  • params_used - Config variable not found in main.nf: params.clip_r1
  • params_used - Config variable not found in main.nf: params.clip_r2
  • params_used - Config variable not found in main.nf: params.three_prime_clip_r1
  • params_used - Config variable not found in main.nf: params.three_prime_clip_r2
  • params_used - Config variable not found in main.nf: params.trim_nextseq
  • params_used - Config variable not found in main.nf: params.save_trimmed
  • params_used - Config variable not found in main.nf: params.skip_trimming
  • params_used - Config variable not found in main.nf: params.remove_ribo_rna
  • params_used - Config variable not found in main.nf: params.save_non_ribo_reads
  • params_used - Config variable not found in main.nf: params.ribo_database_manifest
  • params_used - Config variable not found in main.nf: params.aligner
  • params_used - Config variable not found in main.nf: params.pseudo_aligner
  • params_used - Config variable not found in main.nf: params.seq_center
  • params_used - Config variable not found in main.nf: params.bam_csi_index
  • params_used - Config variable not found in main.nf: params.star_ignore_sjdbgtf
  • params_used - Config variable not found in main.nf: params.hisat2_build_memory
  • params_used - Config variable not found in main.nf: params.stringtie_ignore_gtf
  • params_used - Config variable not found in main.nf: params.min_mapped_reads
  • params_used - Config variable not found in main.nf: params.save_merged_fastq
  • params_used - Config variable not found in main.nf: params.save_unaligned
  • params_used - Config variable not found in main.nf: params.save_align_intermeds
  • params_used - Config variable not found in main.nf: params.skip_markduplicates
  • params_used - Config variable not found in main.nf: params.skip_alignment
  • params_used - Config variable not found in main.nf: params.skip_qc
  • params_used - Config variable not found in main.nf: params.skip_bigwig
  • params_used - Config variable not found in main.nf: params.skip_stringtie
  • params_used - Config variable not found in main.nf: params.skip_fastqc
  • params_used - Config variable not found in main.nf: params.skip_preseq
  • params_used - Config variable not found in main.nf: params.skip_dupradar
  • params_used - Config variable not found in main.nf: params.skip_qualimap
  • params_used - Config variable not found in main.nf: params.skip_rseqc
  • params_used - Config variable not found in main.nf: params.skip_biotype_qc
  • params_used - Config variable not found in main.nf: params.skip_deseq2_qc
  • params_used - Config variable not found in main.nf: params.skip_multiqc
  • params_used - Config variable not found in main.nf: params.deseq2_vst
  • params_used - Config variable not found in main.nf: params.rseqc_modules
  • params_used - Config variable not found in main.nf: params.outdir
  • params_used - Config variable not found in main.nf: params.publish_dir_mode
  • params_used - Config variable not found in main.nf: params.multiqc_config
  • params_used - Config variable not found in main.nf: params.multiqc_title
  • params_used - Config variable not found in main.nf: params.email
  • params_used - Config variable not found in main.nf: params.email_on_fail
  • params_used - Config variable not found in main.nf: params.max_multiqc_email_size
  • params_used - Config variable not found in main.nf: params.plaintext_email
  • params_used - Config variable not found in main.nf: params.monochrome_logs
  • params_used - Config variable not found in main.nf: params.help
  • params_used - Config variable not found in main.nf: params.tracedir
  • params_used - Config variable not found in main.nf: params.validate_params
  • params_used - Config variable not found in main.nf: params.enable_conda
  • params_used - Config variable not found in main.nf: params.singularity_pull_docker_container
  • params_used - Config variable not found in main.nf: params.hostnames
  • params_used - Config variable not found in main.nf: params.config_profile_description
  • params_used - Config variable not found in main.nf: params.config_profile_contact
  • params_used - Config variable not found in main.nf: params.config_profile_url
  • params_used - Config variable not found in main.nf: params.max_memory
  • params_used - Config variable not found in main.nf: params.max_cpus
  • params_used - Config variable not found in main.nf: params.max_time
  • readme - README did not have a Nextflow minimum version badge.

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 1.14
  • Run at 2021-05-12 22:14:21

github-actions[bot] avatar May 12 '21 14:05 github-actions[bot]

Ok, I think I'm about there. We can now set custom error messages in the schema and the error message output formatting is much nicer.

After this PR:

image

And if there is no custom error message set in the schema (worst-case as is a complex validation error):

image

Verbose nextflow log always gives full validation errors for debugging. From the top screenshot:

May-12 21:34:09.352 [main] DEBUG nextflow.Nextflow - #: 4 schema violations found
May-12 21:34:09.358 [main] DEBUG nextflow.Nextflow - #/1: 2 schema violations found
May-12 21:34:09.358 [main] DEBUG nextflow.Nextflow - #/1/fastq_1: string [h ttps://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz] does not match pattern ^\S+\.f(ast)?q\.gz$
May-12 21:34:09.359 [main] DEBUG nextflow.Nextflow - #/1/sample: string [WT REP1] does not match pattern ^\S+$
May-12 21:34:09.359 [main] DEBUG nextflow.Nextflow - #/2/fastq_2: #: no subschema matched out of the total 2 subschemas

ewels avatar May 12 '21 19:05 ewels

Test failures seem unrelated..?

ewels avatar May 12 '21 19:05 ewels

Testing with this samplesheet:

sampe,ft_1,_2,strdedness
WT_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz,reverse
WT_REP2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz,reverse

I get the following:

image

Questions:

  • Doesn't it make sense to only look in the first line for column headers? I realise that this isn't always the case but given that we are already customising some stuff...I found it a bit confusing that the same error message is repeated
  • Can we change key to header column name or something in the message to make it clearer as to what is missing
  • I tried to get the verbose log with -v but that didn't work. Maybe I misunderstood but might be nice to point directly to .nextflow.log if that is the intended pointer?

drpatelh avatar May 12 '21 21:05 drpatelh

Testing with this samplesheet:

sample,fastq_1,fastq_2,strandedness
WT_REP1,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz,reverse
WT_REP2,https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.faq.gz,,reverse

I get the following:

image

Questions:

  • The offending line is actually Row 2. You will notice the file extension is faq.gz there? Be good to test this a little to make sure we are reporting the correct rows in the error messages unless this is expected behaviour? In which case it is quite confusing 😅

drpatelh avatar May 12 '21 22:05 drpatelh

Testing with this samplesheet:

"sample","fastq_1","fastq_2","strandedness","accession","altitude","assembly_quality","assembly_software","base_count","binning_software","bio_material","broker_name
","cell_line","cell_type","center_name","checklist","collected_by","collection_date","completeness_score","contamination_score","country","cram_index_aspera","cram_i
ndex_ftp","cram_index_galaxy","cultivar","culture_collection","depth","description","dev_stage","ecotype","elevation","environment_biome","environment_feature","envi
ronment_material","environmental_package","environmental_sample","experiment_accession","experiment_alias","experiment_title","experimental_factor","fastq_aspera","f
astq_bytes","fastq_ftp","fastq_galaxy","fastq_md5","first_created","first_public","germline","host","host_body_site","host_genotype","host_gravidity","host_growth_co
nditions","host_phenotype","host_sex","host_status","host_tax_id","identified_by","instrument_model","instrument_platform","investigation_type","isolate","isolation_
source","last_updated","lat","library_layout","library_name","library_selection","library_source","library_strategy","location","lon","mating_type","nominal_length",
"nominal_sdev","parent_study","ph","project_name","protocol_label","read_count","run_accession","run_alias","salinity","sample_accession","sample_alias","sample_capt
ure_status","sample_collection","sample_description","sample_material","sample_title","sampling_campaign","sampling_platform","sampling_site","scientific_name","seco
ndary_sample_accession","secondary_study_accession","sequencing_method","serotype","serovar","sex","specimen_voucher","sra_aspera","sra_bytes","sra_ftp","sra_galaxy"
,"sra_md5","strain","study_accession","study_alias","study_title","sub_species","sub_strain","submission_accession","submitted_aspera","submitted_bytes","submitted_f
ormat","submitted_ftp","submitted_galaxy","submitted_host_sex","submitted_md5","submitted_sex","target_gene","tax_id"
"SRX7777164","./results/public_data/SRX7777164_T1_1.fastq.gz","./results/public_data/SRX7777164_T1_2.fastq.gz","unstranded","SAMN14154203","","","","159508570","",""
,"","","","SUB6993965","","Wisconsin State Lab of Hygiene","2020-02-14","","","USA: Wisconsin, Madison","","","","","","","Illumina MiSeq sequencing; SARS-CoV-2 vero
E6_illumina","","","","","","","","false","SRX7777164","veroE6_illumina","Illumina MiSeq sequencing; SARS-CoV-2 veroE6_illumina","","fasp.sra.ebi.ac.uk:/vol1/fastq/S
RR111/046/SRR11140746/SRR11140746.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR111/046/SRR11140746/SRR11140746_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR111/046/SRR
11140746/SRR11140746_2.fastq.gz","61294;23319229;25016500","ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/046/SRR11140746/SRR11140746.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR
111/046/SRR11140746/SRR11140746_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/046/SRR11140746/SRR11140746_2.fastq.gz","ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/046/SRR11
140746/SRR11140746.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/046/SRR11140746/SRR11140746_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/046/SRR11140746/SRR1114074
6_2.fastq.gz","7daeede4030f4da9f424dcea8cbd39f9;a15660f9cecf8344b74cf2c9564866b7;c89d1a727dee49cd33db6e509c975e6f","2020-02-26","2020-02-26","false","Homo sapiens","
","","","","","","","9606","","Illumina MiSeq","ILLUMINA","","2019-nCoV/USA-WI1/2020 Illumina replicate - Vero E6","passage","2020-02-26","43.0731","PAIRED","veroE6_
illumina","RANDOM PCR","METAGENOMIC","WGS","43.0731 N 89.4012 W","-89.4012","","","","PRJEB39908","","","","358971","SRR11140746","veroE6_Illumina.fastq","","SAMN141
54203","veroE6_illumina","","","This sample has been submitted by pda|gkmoreno on 2020-02-22; Severe acute respiratory syndrome coronavirus 2","","This sample has be
en submitted by pda|gkmoreno on 2020-02-22; Severe acute respiratory syndrome coronavirus 2","","","","Severe acute respiratory syndrome coronavirus 2","SRS6189919",
"SRP250294","","","","","","","","","","","2019-nCoV/USA-WI1/2020","PRJNA607948","PRJNA607948","SARS-CoV-2 parallel sequencing by Illumina and Oxford Nanopore Techno
logies","","","SRA1045991","","","","","","","","","","2697049"
"SRX7777166","./results/public_data/SRX7777166_T1_1.fastq.gz","./results/public_data/SRX7777166_T1_2.fastq.gz","unstranded","SAMN14154205","","","","226957916","",""
,"","","","SUB6993965","","Wisconsin State Lab of Hygiene","2020-02-14","","","USA: Wisconsin, Madison","","","","","","","Illumina MiSeq sequencing; SARS-CoV-2 vero
STAT-1KO_illumina","","","","","","","","false","SRX7777166","veroSTAT-1KO_illumina","Illumina MiSeq sequencing; SARS-CoV-2 veroSTAT-1KO_illumina","","fasp.sra.ebi.a
c.uk:/vol1/fastq/SRR111/044/SRR11140744/SRR11140744.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR111/044/SRR11140744/SRR11140744_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fa
stq/SRR111/044/SRR11140744/SRR11140744_2.fastq.gz","44060;31411553;33762390","ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/044/SRR11140744/SRR11140744.fastq.gz;ftp.sra.ebi.ac
.uk/vol1/fastq/SRR111/044/SRR11140744/SRR11140744_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/044/SRR11140744/SRR11140744_2.fastq.gz","ftp.sra.ebi.ac.uk/vol1/fast
q/SRR111/044/SRR11140744/SRR11140744.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/044/SRR11140744/SRR11140744_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/044/SRR1
1140744/SRR11140744_2.fastq.gz","547ec7ddd4d59cfbf19246c6db6907be;439cc63471f416970ba002d75b4c0039;ddac5adc21bd8bee3054c886884c454b","2020-02-26","2020-02-26","false
","Homo sapiens","","","","","","","","9606","","Illumina MiSeq","ILLUMINA","","2019-nCoV/USA-WI1/2020 Illumina replicate - Vero STAT-1 KO","passage","2020-02-26","4
3.0731","PAIRED","veroSTAT-1KO_illumina","RANDOM PCR","METAGENOMIC","WGS","43.0731 N 89.4012 W","-89.4012","","","","PRJEB39908","","","","503344","SRR11140744","ver
oSTAT-1KO_Illumina.fastq","","SAMN14154205","veroSTAT-1KO_illumina","","","This sample has been submitted by pda|gkmoreno on 2020-02-22; Severe acute respiratory syn
drome coronavirus 2","","This sample has been submitted by pda|gkmoreno on 2020-02-22; Severe acute respiratory syndrome coronavirus 2","","","","Severe acute respir
atory syndrome coronavirus 2","SRS6189924","SRP250294","","","","","","","","","","","2019-nCoV/USA-WI1/2020","PRJNA607948","PRJNA607948","SARS-CoV-2 parallel sequen
cing by Illumina and Oxford Nanopore Technologies","","","SRA1045991","","","","","","","","","","2697049"

I get the following:

image

Questions:

  • Can we make it work? This is the format of the auto-generated samplesheet when running the SRA download workflow in the pipeline. We need to quote everything because the SRA metadata fields can contain commas too 👀

drpatelh avatar May 12 '21 22:05 drpatelh

How does the pipeline handle the quotes? Are they stripped off somewhere downstream? We're parsing the file in exactly the same way as the pipeline here so the quotes are being loaded as part of the strings..

Need to think about the level of customisation with the error reports. Should be doable, but now need to specify whether the code is tabular and whether it has a header row. As if a YAML file for example, then it wouldn't make sense to +1 to the entry index and call it rows / columns.

ewels avatar May 13 '21 04:05 ewels

The Python script takes the user provided samplesheet as input and re-formats it before it is loaded in NF. This has allowed the validation and tweaking of the samplesheet e.g. removing quotes like here: https://github.com/nf-core/rnaseq/blob/596499865e31d79c225d9aee6a7bd8a8a8f63615/bin/check_samplesheet.py#L51

drpatelh avatar May 13 '21 08:05 drpatelh

I think this is best handled by stripping the quotes before passing to validation. Shame the Nextflow function doesn't have an option for this..

ewels avatar May 13 '21 20:05 ewels

Closing this as we now have the nf-validation plugin which takes care of all of this and more!

drpatelh avatar Jan 04 '24 10:01 drpatelh