chromap icon indicating copy to clipboard operation
chromap copied to clipboard

Segmentation fault (core dumped) error

Open hnnd opened this issue 2 years ago • 22 comments

The genome is about 3G and come up with Segmentation fault (core dumped) cmd: chromap --preset hic -x reference/genome.index --SAM -1 fastq/HIC508_raw_R1.fastq.gz -2 fastq/HIC508_raw_R2.fastq.gz -r reference/genome.fa -t 20 -o hic.sam

Preset parameters for Hi-C are used. Start to map reads. Parameters: error threshold: 4, min-num-seeds: 2, max-seed-frequency: 500,1000, max-num-best-mappings: 1, max-insert-size: 1000, MAPQ-threshold: 1, min-read-length: 30, bc-error-threshold: 1, bc-probability-threshold: 0.90 Number of threads: 20 Analyze bulk data. Won't try to remove adapters on 3'. Won't remove PCR duplicates after mapping. Will remove PCR duplicates at bulk level. Won't allocate multi-mappings after mapping. Only output unique mappings after mapping. Only output mappings of which barcodes are in whitelist. Allow split alignment. Output mappings in SAM format. Reference file: reference/genome.fa Index file: reference/genome.index 1th read 1 file: fastq/HIC508_raw_R1.fastq.gz 1th read 2 file: fastq/HIC508_raw_R2.fastq.gz Output file: hic.sam Loaded all sequences successfully in 4.68s, number of sequences: 291, number of bases: 3047273682. Kmer size: 17, window size: 7. Lookup table size: 285958605, occurrence table size: 566349725. Loaded index successfully in 8.29s. Mapped 500000 read pairs in 11.97s. Mapped 500000 read pairs in 7.75s. Mapped 500000 read pairs in 8.23s. Mapped 500000 read pairs in 7.61s. Mapped 500000 read pairs in 7.69s. Mapped 500000 read pairs in 6.89s. Mapped 500000 read pairs in 6.80s. Mapped 500000 read pairs in 38.93s. Mapped 500000 read pairs in 7.27s. Mapped 500000 read pairs in 6.59s. Mapped 500000 read pairs in 6.84s. Mapped 500000 read pairs in 6.91s. Mapped 500000 read pairs in 6.99s. Mapped 500000 read pairs in 6.80s. Mapped 500000 read pairs in 61.25s. Mapped 500000 read pairs in 7.04s. Mapped 500000 read pairs in 6.70s. Mapped 500000 read pairs in 6.60s. Mapped 500000 read pairs in 6.54s. Mapped 500000 read pairs in 6.55s. Mapped 500000 read pairs in 6.45s. Mapped 500000 read pairs in 53.85s. Mapped 500000 read pairs in 6.85s. Mapped 500000 read pairs in 6.44s. Mapped 500000 read pairs in 6.32s. Mapped 500000 read pairs in 6.24s. Mapped 500000 read pairs in 5.89s. Mapped 500000 read pairs in 6.31s. Mapped 500000 read pairs in 45.04s. Mapped 500000 read pairs in 7.13s. Mapped 500000 read pairs in 6.83s. Mapped 500000 read pairs in 6.72s. Mapped 500000 read pairs in 6.88s. Mapped 500000 read pairs in 6.82s. Mapped 500000 read pairs in 6.98s. Mapped 500000 read pairs in 36.04s. Mapped 500000 read pairs in 7.41s. Mapped 500000 read pairs in 6.97s. Mapped 500000 read pairs in 6.87s. Mapped 500000 read pairs in 6.77s. Mapped 500000 read pairs in 6.76s. Mapped 500000 read pairs in 6.64s. Mapped 500000 read pairs in 38.23s. Segmentation fault (core dumped)

hnnd avatar Nov 25 '21 13:11 hnnd

Thanks for trying Chromap. I need a bit more information to reproduce the error.

How many reads do you have? And what is the read length? Also can you tell us the size of memory you have and if you observed any out of memory issue? Moreover, it would be good to let us know your OS and the way you install Chromap (compiling yourself or installing by conda)?

If this dataset is public, the easiest way for us to look at this might be trying to run Chromap with this dataset ourselves. If this is the case, let us know.

haowenz avatar Nov 25 '21 14:11 haowenz

Thanks for the quick reply! The total reads number is over 100M, but when mapped about 1M reads it comes with the segmentation fault. The reads length is 150 bp and machine memory is 1 Tb. I tried both compiling locally and installing by conda, the same fault.

hnnd avatar Nov 26 '21 04:11 hnnd

Can you try to generate output in pairs format (by not specifying --SAM) and see if this problem is still there? And is it possible for us to get a sample of your read dataset to reproduce the error and fix it?

haowenz avatar Nov 26 '21 04:11 haowenz

Same error with pairs format. The dataset is too big for share. I've tried a small subset of datasets with 20,000,000 reads without error. But when I increase the reads number to 20 million, the error appears.

hnnd avatar Nov 27 '21 06:11 hnnd

I just tried the latest Chromap on a dataset with 913,515,598 paired-end reads and I was able to generate the output in pairs format. It failed on generating SAM output as the memory usage went out of the memory I requested to run this job on the server.

Based on the information you provided, I guess either there might be some reads that are very special and triggered some corner cases which we forgot to handle in Chromap, or you had to request certain amount of memory and Chromap used more memory than that.

I wonder did you try to run other tools to map the same reads before?

haowenz avatar Nov 27 '21 15:11 haowenz

@hnnd Could you try chromap-0.1.3-asan_x64-linux.tar.bz2 from the download page? The precompiled executable is a debug build that may report extra information on segfault. That executable runs slower and uses more RAM, though.

lh3 avatar Nov 28 '21 18:11 lh3

Hi, I have the same "Segmentation fault (core dumped)" message processing the 400 million reads demo data from the Dovetail Omni-C webpage (https://omni-c.readthedocs.io/en/latest/data_sets.html). This dataset analysis has been successfully completed with bwa-mem2 and we are looking to see if chromap can replace the bwa-mem2 at a lower time cost.

Code run on AWS machine with 16 CPU + 128GB RAM chromap --preset hic -x reference/GRCh38.index -1 OmniC_400M_R1.fastq -2 OmniC_400M_R2.fastq -r eference/GRCh38.fa -t 16 -o test.sam

Chromap installed via conda and should be v0.1.15.

Please do advise if there is any issue with the setup. Of note, the input are raw fastq which are provided by Omni-C instead of the usual gzipped file, is that a problem for Chromap?

Solyris83 avatar Jan 21 '22 05:01 Solyris83

Hi, I have the same "Segmentation fault (core dumped)" message processing the 400 million reads demo data from the Dovetail Omni-C webpage (https://omni-c.readthedocs.io/en/latest/data_sets.html). This dataset analysis has been successfully completed with bwa-mem2 and we are looking to see if chromap can replace the bwa-mem2 at a lower time cost.

Code run on AWS machine with 16 CPU + 128GB RAM chromap --preset hic -x reference/GRCh38.index -1 OmniC_400M_R1.fastq -2 OmniC_400M_R2.fastq -r eference/GRCh38.fa -t 16 -o test.sam

Chromap installed via conda and should be v0.1.15.

Please do advise if there is any issue with the setup. Of note, the input are raw fastq which are provided by Omni-C instead of the usual gzipped file, is that a problem for Chromap?

Did you have "--SAM" for generating SAM output and forget to show it in your command line? Or you were actually generating pairs output. It seems that this problem only happened on some specific machines with some configurations. Since this dataset is publicly available, I will run Chromap on this data to see if I can reproduce the error.

haowenz avatar Jan 21 '22 05:01 haowenz

@haowenz I have tried both --SAM and without and it gave the same segmentation fault for this data.

On a side note, I am assuming the default pair output are the inputs for juicer?

Solyris83 avatar Jan 21 '22 06:01 Solyris83

Did you get segfault after mapping some of the reads or all the reads? Is the Chromap log available?

I am not familiar with juicer. But we used pairtools (https://github.com/open2c/pairtools) to process our pairs files if necessary.

haowenz avatar Jan 21 '22 15:01 haowenz

Segfault happens after mapping some reads, I do not see any logs, do you mean the stderr output? If so attached herein errorr.txt The segmentation fault error is not printed to stderr but to my terminal (in screen mode) My command is found below for clarity

time chromap -t 16 -r genome/GRCh38.primary_assembly.genome.fa.gz -x genome/GRCh38.chromap.index -1 fastq/OmniC_400M_R1.fastq -2 fastq/OmniC_400M_R2.fastq -o chromap/OmniC_400M --preset hic > log.txt 2>errorr.txt Segmentation fault (core dumped) real 617m3.928s user 264m59.301s sys 1m11.830s

Solyris83 avatar Jan 25 '22 01:01 Solyris83

I was able to map the whole dataset you used on the machine I have access to without any issue. However, I do plan to reproduce the error and fix it. In order to do that, can you give me more details about the AWS machine you were using? What is the configurations (like OS)? I will see if I can get the same AWS instance you are using and try to use Chromap to map this dataset on that and debug.

Another temporary solution would be to split the data into several smaller datasets. After mapping, you may merge the results using pairtools.

haowenz avatar Jan 25 '22 02:01 haowenz

that is great to hear, my configuration is found below

  1. AWS C5A.4XLARGE
  2. Ubuntu Server 18.04 LTS

Solyris83 avatar Jan 25 '22 02:01 Solyris83

hi @haowenz , you mentioned that you managed to get the Omni-C 400 mil reads data mapped successfully on your machine, might I ask what is your machine OS and spec ? Any GPU or specifics to help with mapping speed is also much welcomed.

Solyris83 avatar Jan 26 '22 06:01 Solyris83

I used 24 threads on dual Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz and requested 64GB memory. OS is as following.

lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 7.6 (Maipo) Release: 7.6

haowenz avatar Jan 26 '22 15:01 haowenz

Hi developers,

I also encountered a similar "Segmentation fault". My .fastq.gz is ~4G in size and I am running on Ubuntu 16.04 with 60G memory.

The error pops out after some mappings were done,

Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Compute barcode abundance using 20121261 in 60.65s.
Mapped 500000 read pairs in 14.08s.
Mapped 500000 read pairs in 12.72s.
Segmentation fault (core dumped)

Please let me know if you have any solutions or if you would like me to share my data and chromap commands used.

Thanks!

hukai916 avatar Jul 15 '22 20:07 hukai916

Hi developers,

I also encountered a similar "Segmentation fault". My .fastq.gz is ~4G in size and I am running on Ubuntu 16.04 with 60G memory.

The error pops out after some mappings were done,

Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Loaded sequence batch successfully in 0.78s, number of sequences: 500000, number of bases: 8000000.
Compute barcode abundance using 20121261 in 60.65s.
Mapped 500000 read pairs in 14.08s.
Mapped 500000 read pairs in 12.72s.
Segmentation fault (core dumped)

Please let me know if you have any solutions or if you would like me to share my data and chromap commands used.

Thanks!

Can you share the command line you used? If possible, can you also share the data? Which version did you use? The reasons for segfault are usually different for various use cases. It would be better for you to start a new issue for your problem which is easier for us to track.

haowenz avatar Jul 15 '22 21:07 haowenz

hi @haowenz , you mentioned that you managed to get the Omni-C 400 mil reads data mapped successfully on your machine, might I ask what is your machine OS and spec ? Any GPU or specifics to help with mapping speed is also much welcomed.

I forgot to reply to this issue. I was trying to get a AWS machine. But I don't have enough credits for a machine with enough memory. However, we have fixed some related bugs since v0.1.5 and I think the problem is very likely to get resolved already. If you have time, you may try the latest version of Chromap.

haowenz avatar Jul 15 '22 21:07 haowenz

Thanks for the prompt reply!

I am using chromap v0.1.4-r284. The command line is as below:

chromap --preset atac          -t 4     -x chromap_index_arabidopsis_thaliana     -r Genome.primary.chrPrefixed.fa.gz     -1 rep2_merge_read1.fastq.gz     -2 rep2_merge_read2.fastq.gz     -o chromap_fragment_rep2.bed     -b rep2_merge_barcode.fastq.gz --barcode-whitelist whitelist_rep2.txt

All relevant files are being uploaded to Dropbox: https://www.dropbox.com/sh/qbfz00hfykg6mlq/AABVDUY74OLGD3HraBFxnemVa?dl=0

It may take a few hours to sync. Please let me know if can't see the data by tmr.

BTW, I am also testing with the latest chromap release, will report back.

Thanks!

hukai916 avatar Jul 15 '22 21:07 hukai916

Thanks for the prompt reply!

I am using chromap v0.1.4-r284. The command line is as below:

chromap --preset atac          -t 4     -x chromap_index_arabidopsis_thaliana     -r Genome.primary.chrPrefixed.fa.gz     -1 rep2_merge_read1.fastq.gz     -2 rep2_merge_read2.fastq.gz     -o chromap_fragment_rep2.bed     -b rep2_merge_barcode.fastq.gz --barcode-whitelist whitelist_rep2.txt

All relevant files are being uploaded to Dropbox: https://www.dropbox.com/sh/qbfz00hfykg6mlq/AABVDUY74OLGD3HraBFxnemVa?dl=0

It may take a few hours to sync. Please let me know if can't see the data by tmr.

Thanks!

Thank you for sharing the data. As I mentioned, we have fixed several bugs since the old version. And it is very likely that the bug you had has already been resolved in the latest version of Chromap. Can you try the latest version of Chromap? I would like to help if you still have this problem with the latest Chromap.

haowenz avatar Jul 15 '22 21:07 haowenz

Will do.

hukai916 avatar Jul 15 '22 21:07 hukai916

Testing with v0.2.3, so far so good, will report back if anything goes awry.

hukai916 avatar Jul 15 '22 22:07 hukai916