SalmonTools icon indicating copy to clipboard operation
SalmonTools copied to clipboard

Segmentation fault on MashMap step of generateDecoyTranscriptome.sh

Open jaclyn-taroni opened this issue 6 years ago • 14 comments

Hi all,

I get Segmentation fault (core dumped) on step 3 of generateDecoyTranscriptome.sh.

I've filed https://github.com/marbl/MashMap/issues/21 upstream with more detailed information. I wanted to file an issue here in case you have any insight or I am using the script improperly.

Here's how I'm using this:

bash scripts/generateDecoyTranscriptome.sh \
	-j 8 \
	-g Homo_sapiens.GRCh38.dna.toplevel.fa \
	-t Homo_sapiens.GRCh38.cdna.all.fa \
	-a Homo_sapiens.GRCh38.96.gtf \
        -o ${human_output}

I realize you have gentrome.fa and decoys.txt for human here: https://github.com/COMBINE-lab/salmon#pre-computed-decoy-transcriptomes

I'm interested in generating this for zebrafish and happened to run into this problem with human first/before I found that on the Salmon README.

Thank you!

jaclyn-taroni avatar Jun 05 '19 14:06 jaclyn-taroni

Hi @jaclyn-taroni ,

Thanks for raising this issue, one other user is also facing the similar issue with human genome. While MashMap peeps and we are looking for the cause and the solution for the problem, if you can forward me the links to zebrafish genome and gtf I can run it in our system and forward to you the decoy sequences.

k3yavi avatar Jun 06 '19 19:06 k3yavi

Hi @k3yavi,

Thanks for the quick reply and the offer. I was planning on using the most recent Ensembl release for zebrafish. Here are the relevant links:

ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz

Thanks again!

jaclyn-taroni avatar Jun 07 '19 11:06 jaclyn-taroni

Hi @jaclyn-taroni,

@k3yavi has built the decoy transcriptome for zebrafish, you can grab it from the link on the salmon readme.

--Rob

rob-p avatar Jun 07 '19 19:06 rob-p

Thank you very much @k3yavi and @rob-p!

jaclyn-taroni avatar Jun 09 '19 15:06 jaclyn-taroni

hi @k3yavi

I'm getting the same error with data from a tick species -- any chance you'd be willing to run this for me, too?

The genome is (we use the first one, Ixodes-Scapularis-IES6_...): https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=457&field_download_file_format_tid=All&field_status_value=Current

The .gtf (ISE6, same as above): https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=412&field_download_file_format_tid=473&field_status_value=Current

And a transcriptome that is as of yet unpublished/posted -- I'd have to send it.

cmatKhan avatar Jun 26 '19 01:06 cmatKhan

Hi @cmatKhan , Ixodes_scapularis.tar.gz should do it.

k3yavi avatar Jun 26 '19 02:06 k3yavi

Very much appreciated.

I realized after I hit send that there is a transcriptome on vectorbase -- I assume that's what you used?

choulabucsf avatar Jun 26 '19 03:06 choulabucsf

Actually I just used the gtf and the genome to extract the transcriptome .

k3yavi avatar Jun 26 '19 10:06 k3yavi

Hi Guys,

Just to give the heads up, we have curated the decoys sequence of a subset of model organism and it can be found here.

k3yavi avatar Jul 17 '19 01:07 k3yavi

I'm having this issue as well, I've tried it on a couple machines although the most RAM so far is 24GB (20 free).

Any chance you could generate decoys for refseq human and mouse? They give GFF annotation files, I was feeding that directly into step 2 (instead of the exons.bed) and step 2 completes fine, but step 3 fails pretty early with segmentation fault.

Alternatively, can you give an estimate of how much RAM this script is using on your machine where it successfully completes? Also, how long do you typically find it takes? I've not used MashMap before. I tried doing a trial run with a smaller genome and gave it 10 threads and while it didn't have a segmentation fault, after ~ 6 hours in step 3 I gave up since I didn't really need the decoys but was surprised at how long it was taking.

Thanks!

doubtfulresearch avatar Jul 24 '19 12:07 doubtfulresearch

Hi, please fill the following decoy generation request form https://forms.gle/3baJc5SYrkSWb1z48 and we will let you know once we have the decoys.

On our machine it was taking ~100G and approximately an hour to run for human gencode data.

Thanks !

k3yavi avatar Jul 25 '19 07:07 k3yavi

Hi guys,

Just wanted to let you know, we recently released a new version of salmon where you don't have to explicitly run the mashmap pipeline. With v1.0 salmon can consume both the genome and transcriptome without the need of annotations. Please checkout the new preprint or follow this tutorial for redindexing.

k3yavi avatar Nov 03 '19 22:11 k3yavi

Thank you so much! I asked in the chat, but just in case. Any estimation of memory during index and quantification, assuming a human genome like reference? Thanks!

lpantano avatar Nov 04 '19 20:11 lpantano

Hi @lpantano,

The indexing using the entire human genome as decoy and the whole transcriptome (gencode v29) as the actual target sequence takes ~20G of RAM in our runs. The final (dense) index size is ~19G so construction RAM is only a little bit more. Interestingly, while the final index for using the whole genome as decoy is considerably bigger than if one uses the mashmap decoy sequences, the indexing memory is quite a bit smaller.

rob-p avatar Nov 04 '19 21:11 rob-p