dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Dorado Basecaller filter 90% of reads on Hac or Sup models (but not on Fast model)

Open SouvilleL opened this issue 1 year ago • 1 comments

Issue Report

Please describe the issue:

Hello, As suggested by ONT support, i am submitting an issue encountered while using dorado basecaller.

When basecalling a particular dataset, Dorado filter an important number of reads but only when using hac or sup models. Fast Model is unaffected. As i understand, this behavior is not normal and Dorado is not supposed to filter that much reads.

The data come from a run that crashed during sequencing due to a lack of computer ressources. Sequencing was done on a MinION MK1b, on a windows 10 machine.

In General, ~ 90% of reads are filtered (in hac and sup models). Changing Dorado version does not affect the number of filtered reads. The number of filtered reads does not vary between multiple basecallings.

Model Reads basecalled Reads filtered

Fast 7 826 867 4141

Hac 480 717 7 346 151

Sup 716 401 7 110 470

Steps to reproduce the issue:

Run the command : dorado basecaller [email protected] ./dataset

Run environment:

  • Dorado version: 0.6.2 / 0.7.3

  • Dorado command: - dorado basecaller [email protected] ./pod5_folder - dorado basecaller [email protected] ./pod5_folder - dorado basecaller [email protected] ./pod5_folder

  • Operating system: Ubuntu 22.04.4 LTS

  • Hardware (CPUs, Memory, GPUs): Intel® Xeon(R) w9-3495X, RTX A6000 x2 , 512 Go DDR5

  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5

  • Source data location (on device or networked drive - NFS, etc.): On device

  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): - Flowcell: FLO-MIN114 - Read length: N50 1.21kb - Number of reads 7.84M reads - Dataset 8.09Gb, 94.3GB

  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): A data subset of 7GB is available to be shared.

Logs

Head and Tail of logs of dataset's basecalling The complete log is available to be shared but is too big to be uploaded.

Head

[2024-08-21 11:55:33.089] [info] Running: "basecaller" "[email protected]" "-vv" "2024A92_ET_2024A93/20240612_1029_MN35911_FAZ25801_b2e80db0/recovered_reads_sample1.1/" [2024-08-21 11:55:33.098] [trace] Model option: '[email protected]' unknown - assuming path [2024-08-21 11:55:33.098] [info] > Creating basecall pipeline [2024-08-21 11:55:33.833] [debug] cuda:0 memory available: 49.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 memory limit 48.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 7296 [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 14656 [2024-08-21 11:55:33.833] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64 [2024-08-21 11:55:33.913] [debug] cuda:1 memory available: 50.36GB [2024-08-21 11:55:33.913] [debug] cuda:1 memory limit 49.36GB

Tail

[2024-08-21 11:55:33.089] [info] Running: "basecaller" "[email protected]" "-vv" "2024A92_ET_2024A93/20240612_1029_MN35911_FAZ25801_b2e80db0/recovered_reads_sample1.1/" [2024-08-21 11:55:33.098] [trace] Model option: '[email protected]' unknown - assuming path [2024-08-21 11:55:33.098] [info] > Creating basecall pipeline [2024-08-21 11:55:33.833] [debug] cuda:0 memory available: 49.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 memory limit 48.66GB [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 7296 [2024-08-21 11:55:33.833] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 14656 [2024-08-21 11:55:33.833] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64 [2024-08-21 11:55:33.913] [debug] cuda:1 memory available: 50.36GB [2024-08-21 11:55:33.913] [debug] cuda:1 memory limit 49.36GB

SouvilleL avatar Sep 02 '24 16:09 SouvilleL

Hi @SouvilleL, Apologies for the delayed reply.

We've seen other issues when basecalling recovered data from failed / interrupted runs and we're looking into this.

However, could you update the initial message to reformat and clarify the values in this table:

Model Reads basecalled Reads filtered Fast 7 826 867 4141 Hac 480 717 7 346 151 Sup 716 401 7 110 470

And the second log message marked Tail has the same information as Head.

Kind regards, Rich

HalfPhoton avatar Sep 18 '24 11:09 HalfPhoton

Hi @HalfPhoton , my apologies for the delay. I did some edits to the formatting and the tail log message.

SouvilleL avatar Oct 15 '24 12:10 SouvilleL

I have the same issue with the version 5.0.0 models. I have a sample which I have previously basecalled with dorado 0.4.x using the sup model - yielding 3.5 mio reads. Now 94 % is lost when I rebasecall with the sup model. Could you add a flag that skips the filtering entirely?

dorado --version: 0.8.3

SebastianDall avatar Nov 19 '24 11:11 SebastianDall

Tested this on a small set of 10000 reads with dorado v 0.8.2 and 3 different sup model versions (4.2.0, 4.3.0, and 5.0.0).

Filtering: 0, 456, and 9379 reads

Commands used to run dorado below. Only the version of the sup model changed.

❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/[email protected]/ pod5s/ > calls.bam
[2024-11-19 13:57:52.633] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/[email protected]/" "pod5s/"
[2024-11-19 13:57:52.728] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 13:57:52.728] [info] Normalised: overlap 500 -> 498
[2024-11-19 13:57:52.728] [info] > Creating basecall pipeline
[2024-11-19 13:57:54.984] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/[email protected]/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:58:06.385] [info] cuda:0 using chunk size 9996, batch size 1088
[2024-11-19 13:58:07.418] [info] cuda:0 using chunk size 4998, batch size 2048
[2024-11-19 13:58:29.926] [info] > Simplex reads basecalled: 10000
[2024-11-19 13:58:29.926] [info] > Basecalled @ Samples/s: 7.957384e+06
[2024-11-19 13:58:29.929] [info] > Finished
❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/[email protected]/ pod5s/ > calls.bam
[2024-11-19 13:54:41.138] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/[email protected]/" "pod5s/"
[2024-11-19 13:54:41.224] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 13:54:41.224] [info] Normalised: overlap 500 -> 498
[2024-11-19 13:54:41.224] [info] > Creating basecall pipeline
[2024-11-19 13:54:43.279] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/[email protected]/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:54:53.069] [info] cuda:0 using chunk size 9996, batch size 960
[2024-11-19 13:54:54.028] [info] cuda:0 using chunk size 4998, batch size 1920
[2024-11-19 13:55:17.834] [info] > Simplex reads basecalled: 9545
[2024-11-19 13:55:17.834] [info] > Simplex reads filtered: 456
[2024-11-19 13:55:17.834] [info] > Basecalled @ Samples/s: 7.401725e+06
[2024-11-19 13:55:17.838] [info] > Finished
❯ dorado basecaller --device "cuda:0" /data/software/dorado-0.8.2-linux-x64/models/[email protected]/ pod5s/ > calls.bam
[2024-11-19 13:53:18.393] [info] Running: "basecaller" "--device" "cuda:0" "/data/software/dorado-0.8.2-linux-x64/models/[email protected]/" "pod5s/"
[2024-11-19 13:53:18.481] [info] > Creating basecall pipeline
[2024-11-19 13:53:23.605] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 4090" and model /data/software/dorado-0.8.2-linux-x64/models/[email protected]/. Full benchmarking will run for this device, which may take some time.
[2024-11-19 13:53:29.421] [info] cuda:0 using chunk size 12288, batch size 224
[2024-11-19 13:53:30.078] [info] cuda:0 using chunk size 6144, batch size 224
[2024-11-19 13:54:23.312] [info] > Simplex reads basecalled: 621
[2024-11-19 13:54:23.312] [info] > Simplex reads filtered: 9379
[2024-11-19 13:54:23.312] [info] > Basecalled @ Samples/s: 3.193841e+06
[2024-11-19 13:54:23.314] [info] > Finished

Kirk3gaard avatar Nov 19 '24 13:11 Kirk3gaard

@Kirk3gaard could you share these reads with us please? We would like to reproduce this issue internally asap.

vellamike avatar Nov 19 '24 13:11 vellamike

@Kirk3gaard is this on RTX 4090? @SebastianDall what GPU did you see this on?

iiSeymour avatar Nov 19 '24 13:11 iiSeymour

10k read test data from @SebastianDall : https://www.dropbox.com/scl/fi/9f51sxtd5mmxxxpjcg8gm/10k.pod5?rlkey=b6sbf4gzipg3b830yusxra4mf&dl=0

Yes we have seen the issue on RTX4090 and A10.

Kirk3gaard avatar Nov 19 '24 13:11 Kirk3gaard

I am also experiencing this issue on a test set of 24000 recovered reads with dorado v0.7.2 running on Tesla V100S PCIe

model basecalled reads filtered
[email protected] 23976 24
[email protected] 902 23908
[email protected] 0 24000
>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/[email protected]/ pod5/ > calls.bam
[2024-11-19 07:54:22.275] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/[email protected]/" "pod5/"
[2024-11-19 07:54:22.319] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 07:54:22.319] [info] Normalised: overlap 500 -> 498
[2024-11-19 07:54:22.319] [info] > Creating basecall pipeline
[2024-11-19 07:56:56.619] [info] > Simplex reads basecalled: 23976
[2024-11-19 07:56:56.619] [info] > Simplex reads filtered: 24
[2024-11-19 07:56:56.619] [info] > Basecalled @ Samples/s: 6.513212e+06
[2024-11-19 07:56:56.635] [info] > Finished
>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/[email protected]/ pod5/ > calls.bam
[2024-11-19 07:38:53.450] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/[email protected]/" "pod5/"
[2024-11-19 07:38:53.501] [info] Normalised: chunksize 10000 -> 9996
[2024-11-19 07:38:53.501] [info] Normalised: overlap 500 -> 498
[2024-11-19 07:38:53.501] [info] > Creating basecall pipeline
[2024-11-19 07:41:33.760] [info] > Simplex reads basecalled: 902
[2024-11-19 07:41:33.760] [info] > Simplex reads filtered: 23098
[2024-11-19 07:41:33.760] [info] > Basecalled @ Samples/s: 5.525874e+06
[2024-11-19 07:41:33.788] [info] > Finished
>dorado basecaller -r -x "cuda:all" --no-trim --verbose dorado_models/[email protected]/ pod5/ > calls.bam
[2024-11-19 07:58:17.100] [info] Running: "basecaller" "-r" "-x" "cuda:all" "--no-trim" "--verbose" "dorado_models/[email protected]/" "pod5/"
[2024-11-19 07:58:17.143] [info] > Creating basecall pipeline
[2024-11-19 08:00:44.156] [info] > Simplex reads filtered: 24000
[2024-11-19 08:00:44.156] [info] > Basecalled @ Samples/s: 4.063903e+06
[2024-11-19 08:00:44.159] [info] > Finished

lerminin avatar Nov 19 '24 14:11 lerminin

@Kirk3gaard @lerminin @SouvilleL,

Thank you for your patience with this matter.

The script below should repair pod5 files which have had their calibration values set incorrectly during recovery which we believe is the cause of this issue.

This script is executable from a python environment with the pod5 api installed.

python fix_calibration_scale.py --source /path/to/input --output /path/to/output 

If the calibration issue is not detected in a file then the file will be skipped. Otherwise, records are written to a new file in the output directory with the same name as the source file and an optional suffix.

Please give the script a try on one of your pod5 files to see if it resolves your issue.

Best regards, Rich


from argparse import ArgumentParser
from pathlib import Path
import pod5 as p5
from pod5.pod5_types import Calibration, Read
from pod5.tools.utils import collect_inputs
from tqdm import tqdm


def fix_calibration_scale(record: p5.ReadRecord) -> Read:
    """Fix the calibration scale for a read record by dividing by the digitisation"""
    if record.calibration.scale < 1:
        calibration = record.calibration
    else:
        calibration = Calibration(
            offset=record.calibration.offset,
            scale=record.calibration.scale / record.calibration_digitisation,
        )

    return Read(
        read_id=record.read_id,
        pore=record.pore,
        calibration=calibration,
        median_before=record.median_before,
        end_reason=record.end_reason,
        read_number=record.read_number,
        run_info=record.run_info,
        start_sample=record.start_sample,
        num_minknow_events=record.num_minknow_events,
        tracked_scaling=record.tracked_scaling,
        predicted_scaling=record.predicted_scaling,
        num_reads_since_mux_change=record.num_reads_since_mux_change,
        time_since_mux_change=record.time_since_mux_change,
        signal=record.signal,
    )


def argparser() -> ArgumentParser:
    parser = ArgumentParser(
        "fix_calibration_scale",
        description="Repairs the calibration scale value corrupted during file recovery.",
    )
    parser.add_argument(
        "-s",
        "--source",
        type=Path,
        default=Path.cwd(),
        help="Source directory of pod5 files. Does not search recursively.",
    )
    parser.add_argument(
        "-o",
        "--output",
        type=Path,
        default=Path.cwd(),
        help="Output directory of pod5 files. Never overwrites files.",
    )
    parser.add_argument(
        "-x",
        "--suffix",
        type=str,
        default="",
        help="Suffix written to output pod5 file name as <source><.suffix>.pod5. Default: ``",
    )
    parser.add_argument(
        "-d",
        "--deep",
        action="store_true",
        help="Stops skipping files which appear normal.",
    )
    return parser


def main(source: Path, output: Path, suffix: str, deep: bool):
    """
    Fix calibration scale value in recovered pod5 files in source dir and write to
    output dir with optional filename suffix.
    """

    def out_path(src: Path, output: Path):
        if not suffix:
            return output / src.name
        return output / src.with_suffix(f".{suffix}.pod5").name

    print(f"Searching {source.resolve()} for pod5 files.")

    source_pod5s = collect_inputs([source], False, "*.pod5", threads=1)
    if not source_pod5s:
        raise ValueError(f"Found no pod5 files in: {source.resolve()}")
    else:
        print(f"Found {len(source_pod5s)} pod5 files to recover.")

    # Check output exists or make it
    if not output.exists():
        print(f"Creating output directory: {output.resolve()}")
        output.mkdir(parents=True)

    if not output.is_dir():
        raise NotADirectoryError(f"Output is not a directory: {output.resolve()}")

    # Check no overwrite
    existing = set(
        [out_path(s, output).name for s in source_pod5s if out_path(s, output).exists()]
    )
    if existing:
        raise FileExistsError(
            f"Found {len(existing)} output files which will not be overwritten."
        )

    # Iterate over all input files
    for s in tqdm(
        source_pod5s,
        desc="Processing Files",
        total=len(source_pod5s),
        dynamic_ncols=True,
        ascii=True,
        unit="file",
        position=0,
    ):
        with p5.Reader(s) as reader:
            # Skip empty files
            if reader.num_reads < 1:
                tqdm.write(f"zero reads in file: {s.resolve()}")
                continue

            if not deep:
                # Skip files that look normal
                first_record = next(reader.__iter__())
                if first_record.calibration.scale < 1:
                    tqdm.write(f"Calibration scale is normal in: {s.resolve()}")
                    continue

            with p5.Writer(out_path(s, output)) as writer:
                for record in tqdm(
                    reader,
                    "Fixing Reads",
                    total=reader.num_reads,
                    dynamic_ncols=True,
                    ascii=True,
                    unit="read",
                    smoothing=0,
                    mininterval=0.1,
                    position=1,
                    leave=False,
                ):

                    writer.add_read(fix_calibration_scale(record))

    print("Done")


if __name__ == "__main__":
    parser = argparser()
    args = parser.parse_args()
    main(**vars(args))

HalfPhoton avatar Nov 29 '24 14:11 HalfPhoton

this improved results for me:

model basecalled reads filtered
[email protected] 23976 24
[email protected] 23997 4
[email protected] 24000 0

lerminin avatar Nov 29 '24 18:11 lerminin

Awesome work! So far for the 10k reads this worked like a charm! Processing the full sample now!

Thank you so much!

SebastianDall avatar Nov 30 '24 07:11 SebastianDall

Which versions of pod5 are susceptible to this recovery bug?

sashajenner avatar Dec 01 '24 23:12 sashajenner

@sashajenner, this isn't an issue with POD5. The upstream issue should also be resolved.

HalfPhoton avatar Dec 02 '24 10:12 HalfPhoton

It doesn't seems to have worked for my data.

Data + test sample (pre recalibration): https://filesender.renater.fr/?s=download&token=f6bb3b36-cf9f-46a7-b09b-972b1c95f11e

after recalibration

Model basecalled reads reads filtered
Fast v5.0.0 6 764 825 1 062 040
Fast v4.3.0 7 821 459 5408
Hac v4.3.0 480 715 7 346 153
Hac v5.0.0 481 081 7 345 787
Sup v4.3.0 716 187 7 110 684
Sup v5.0.0 480 338 7 346 531

SouvilleL avatar Dec 06 '24 16:12 SouvilleL

Hi @SouvilleL,

Your files appear to have an additional issue, in that the values of adc_min and adc_max are both zero:

pod5 inspect read recovered_94079d95-f423-4080-9336-46b4f2c825d3_101.pod5 67f3ee4c-bf52-4fe3-bcac-fca326fc321c | grep adc_
        adc_max: 0
        adc_min: 0

This leads to the record.calibration_digitisation value being 1, so your reads are effectively left unchanged. If you replace this variable with 8192 (based on a quick perusal of some other minion data) in the script above, this should fix your data.

malton-ont avatar Dec 18 '24 16:12 malton-ont