pycoQC icon indicating copy to clipboard operation
pycoQC copied to clipboard

Generated summary file missing required column

Open skchronicles opened this issue 1 year ago • 0 comments

Hello @a-slide,

I hope you are having a great day! I was testing out pycoQC and ran into an issue after generating a summaries file with Fast5_to_seq_summary.

Describe the bug The Fast5_to_seq_summary output summaries file was passed to pycoQC and produced the following error:

Traceback (most recent call last):
  File "/usr/local/bin/pycoQC", line 8, in <module>
    sys.exit(main_pycoQC())
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 115, in main_pycoQC
    pycoQC (
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC.py", line 120, in pycoQC
    parser = pycoQC_parse (
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 96, in __init__
    summary_reads_df = self._parse_summary()
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 136, in _parse_summary
    df = self._select_df_columns (
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/pycoQC_parse.py", line 397, in _select_df_columns
    raise pycoQCError("Column {} not found in the provided sequence_summary file".format(col))
pycoQC.common.pycoQCError: Column read_len not found in the provided sequence_summary file

To Reproduce Steps to reproduce the behavior:

  1. Fast5_to_seq_summary command to generate the summary file:
$ Fast5_to_seq_summary --threads 8 -f sample/fast5 -s summary.tsv --verbose 2

Here are the first few lines of the output summary.tsv file:

read_id	run_id	channel	start_time
000a1b52-fad6-4d6f-b113-c4b24013fcf9	8d6deda632c3a7303f91016b7707e7310e0bc054	256	42618
0026ba30-0061-401d-8dc1-3cb556d71cb9	8d6deda632c3a7303f91016b7707e7310e0bc054	133	29349
000d264a-1a98-4a55-beb5-9f02dd42fce2	8d6deda632c3a7303f91016b7707e7310e0bc054	170	42809
001ddc14-ccb8-42c3-9fd3-74db3c431a75	8d6deda632c3a7303f91016b7707e7310e0bc054	110	42649
0048af85-5c18-4745-b51e-2fab957aceab	8d6deda632c3a7303f91016b7707e7310e0bc054	61	42292
00519880-3d53-4ee3-8528-7a388ad69b24	8d6deda632c3a7303f91016b7707e7310e0bc054	198	42873

As you can see here, there is no column containing sequence/read length information.

  1. pycoQC command to generate the report:
$ pycoQC -f summary.tsv -o  test.html -j test.json --verbose

Expected behavior

I was expecting the summaries file generated by Fast5_to_seq_summary to be compatible with pycoQC. I also tried re-running the Fast5_to_seq_summary with the following fields option (to include everything):

--fields barcode_arrangement barcode_full_arrangement barcode_score calibration_strand_end calibration_strand_genome_template calibration_strand_identity calibration_strand_start called_events channel channel_digitisation channel_offset channel_range channel_sampling_rate device_id duration flow_cell_id mean_qscore_template protocol_run_id read_id read_number run_id sample_id sequence_length_template skip_prob start_mux start_time stay_prob step_prob strand_score

however, that did not seem to help, and I am getting the same error message.

I can see here, in your parser, that you are looking for these columns to rename and then check to see if they exist.

image

however, if I try to pass sequence_length_2 or sequence_length to the --fields option of Fast5_to_seq_summary, it errors out:

Check input data and options
Traceback (most recent call last):
  File "/usr/local/bin/Fast5_to_seq_summary", line 8, in <module>
    sys.exit(main_Fast5_to_seq_summary())
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/__main__.py", line 168, in main_Fast5_to_seq_summary
    Fast5_to_seq_summary (
  File "/usr/local/lib/python3.10/dist-packages/pycoQC/Fast5_to_seq_summary.py", line 119, in __init__
    raise pycoQCError ("Field {} is not valid, please choose among the following valid fields: {}".format(field, ",".join(self.attrs_grp_dict.keys())))
pycoQC.common.pycoQCError: Field sequence_length_2d is not valid, please choose among the following valid fields: mean_qscore_template,sequence_length_template,called_events,skip_prob,stay_prob,step_prob,strand_score,read_id,start_time,duration,start_mux,read_number,channel,channel_digitisation,channel_offset,channel_range,channel_sampling_rate,run_id,sample_id,device_id,protocol_run_id,flow_cell_id,calibration_strand_genome_template,calibration_strand_end,calibration_strand_start,calibration_strand_identity,barcode_arrangement,barcode_full_arrangement,barcode_score

Desktop:

  • OS: Ubuntu 20.04
  • pycoQC Version: v.2.5.2, installed from pypi

If you need anything else, please let me know.

Best Regards, @skchronicles

skchronicles avatar Feb 09 '23 16:02 skchronicles