salmon icon indicating copy to clipboard operation
salmon copied to clipboard

The index version file doesn't seem to exist

Open hariiyer16 opened this issue 4 years ago • 7 comments

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? This is relevant to salmon. Describe the bug I have been trying to run the salmon quant commando and have kept running into the following error: Exception : [Error: The index version file salmon_index_gencode_vM25/versionInfo.json doesn't seem to exist. Please try re-building the salmon index.] salmon quant was invoked improperly. bad After going through some of the previous complaints, I figured that there was a bad memory allocation while creating the index. I increased the memory allocation which seems to have solved the indexing problem (please see attached the output from the indexing run). The memory allocation was as below: qrsh -l mem_free=30G,h_vmem=35G,h_stack=256M I had tried 24G but did not work. So increased to 30G and that seem to have worked for indexing.

Although I can see the versionInfo.json file in the index folder, I still get the above error.

I used the method described in the following link to create the index, except that I used mouse gencode release vM25. https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

Output of the indexing run is attached (indexing_output.sh).

I highly appreciate any help in troubleshooting this issue. Thank you.

To Reproduce Steps and data to reproduce the behavior: I can send the read files I am trying to align/quantify (*.fq.gz files). I am not able to attach them here because of the size.

Specifically, please provide at least the following information:

  • Which version of salmon was used? salmon 1.3.0

  • How was salmon installed (compiled, downloaded executable, through bioconda)? wget https://github.com/COMBINE-lab/salmon/releases/download/v1.3.0/salmon-1.3.0_linux_x86_64.tar.gz tar xzvf salmon-1.3.0_linux_x86_64.tar.gz Directory was relabeled as salmon.

  • Which reference (e.g. transcriptome) was used? Mouse Gencode vM25 wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.transcripts.fa.gz wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz

  • Which read files were used? Files used for salmon quant are attached (*fq.gz)

  • Which which program options were used? None specifically was invoked to run salmon. I am running salmon on a cluster (Linux 3.10.0-957.el7.x86_64 x86_64).

Expected behavior A clear and concise description of what you expected to happen.

I expect the salmon quant to align and quantify the reads.

Screenshots If applicable, add screenshots or terminal output to help explain your problem.

  1. Attaching the screenshot in the zipped folder.
  2. Attaching the screenshot of the contents in the folder containing the indexed file. The versionInfo.json file is present in that folder.

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu Linux, OSX] Linux 3.10.0-957.el7.x86_64 x86_64
  • Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a] Linux compute-106.cm.cluster 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

lsb-release:

LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.6.1810 (Core) Release: 7.6.1810 Codename: Core

Additional context Add any other context about the problem here. Files.zip

hariiyer16 avatar Aug 15 '20 16:08 hariiyer16

Thanks for the detailed report, @hariiyer16! We'll look into this as soon as we can.

rob-p avatar Aug 15 '20 16:08 rob-p

Thank you. BTW, file sused for salmon quant (*fq.gz) were not attached because of the large size.

hariiyer16 avatar Aug 15 '20 16:08 hariiyer16

They should be unnecessary to diagnose, but if you want to extract the first 100k reads or so, we can try and use them for test quantification.

rob-p avatar Aug 15 '20 16:08 rob-p

Hi @hariiyer16 ,

I could not immediately replicate this. When I build an index with those files, I get one that seems to work in terms of mapping and quantifying reads. From the files you shared, it certainly does seem like the index is being created correctly. I'm including here the sha256sum of the index files I get when I build this index on one of our machines. Perhaps we could see if these match:

$:salmon_index [j1] (develop ?) $ sha256sum *
306e9d98b3460859f579059bf876aa3b6e264c8f38c04cde332b03632edc6dfb  complete_ref_lens.bin
28519aac34b84b4d0570c97340815e719511c204e04a240dd43e365d2872eed3  ctable.bin
1c7501deaa4524f4700152713228cb03949775dce481384eac67bb45458508be  ctg_offsets.bin
dbc575fed0d589b4671c26bd8cbcb4b3d52ef41c299a90de978ab940abb751fc  duplicate_clusters.tsv
987050914456cf247a24136429d8faaa293cf5617bfd57166c64976b2778d95b  info.json
0b7e8cb4ebed78513900831c047f0d66589068921c33bb15c49b3567c84e2edc  mphf.bin
117369928fde1bff4ca278246c331e079cc0860c3b415e34cd4b08f588063abc  pos.bin
297492e67d274b2ff8f026d2fbc8045f96e17793a58dd74c19b5ab1b7156df8a  pre_indexing.log
8e665e5fdee5af6fcedabc69fd04eda6e66055ef811ebde6de6f86a66521198a  rank.bin
793c79f5fd6046dfea07bbc9587d2835088e54c78197d652d1b1f205c6b16983  refAccumLengths.bin
92acf575c90c6954ff75be1ea791f822eee05e486c6e86c52943d8bc1a0849ca  ref_indexing.log
b580b9c6257254a018a9ae22291a64892c1a3715c69272637f5c504fc5545a70  reflengths.bin
89679603ac0b28042275e5ff04b222bad3fd431cab573f0c2b61e7455aec43e7  refseq.bin
94cb79a2f4acd811d2164f2926c96869a8103b9118170d0688f57b46e695cd5c  seq.bin
89d56bb135f32c7b5fa337bc3c45814b80c2886a3cccc31ff0533c6324ca11fd  versionInfo.json

I'm also including a link here to a tarball containing the index I built. Could you see if you can perform quantification with this index? Finally, it might be worth checking that nothing strange / unexpected is going on with how libraries are being resolved in the linker path when you are running salmon. Could you share the output of running ldd salmon? If none of those point at anything obvious, I might also suggest seeing if it runs as expected inside a Docker container. You can grab a dockerfile for salmon here.

rob-p avatar Aug 15 '20 22:08 rob-p

Hi Rob, After posting yesterdays message, I generated the vM23 index, and the alignments/quants worked. I had to use mem_free=34G for building index. Is that expected? I will try building the vM25 index again and and post the update. In the meantime, sha256sum of my vM25 index that I had generated has some mismatches from the one you created. Below is my sha256sum on vM25 index: 306e9d98b3460859f579059bf876aa3b6e264c8f38c04cde332b03632edc6dfb complete_ref_lens.bin 636b3df7e097d58fa846bd85ce650ce5bf72c66dc5b2d7566fc9e3db087c5c9c ctable.bin 1c7501deaa4524f4700152713228cb03949775dce481384eac67bb45458508be ctg_offsets.bin dbc575fed0d589b4671c26bd8cbcb4b3d52ef41c299a90de978ab940abb751fc duplicate_clusters.tsv c3ec09a30adc9d47bc95839157cb2ff66530353106a4fd8e75b167ac5db67820 info.json 430be78bae99a4592fcedc5c800a42313f2b1252e3953f89f347779056c1ee5b mphf.bin 2fb0b5151f9f2544c06a9f95d03075f7af0494d0fe31745504a5a7da43edc1b1 pos.bin 15d3bb6a16bcd8c1a6814852bd3dcfa439b60ec84c706f868ee7ec2d5a90581d pre_indexing.log 8e665e5fdee5af6fcedabc69fd04eda6e66055ef811ebde6de6f86a66521198a rank.bin 793c79f5fd6046dfea07bbc9587d2835088e54c78197d652d1b1f205c6b16983 refAccumLengths.bin c5ea8eccca3fdc299ad7c9d2f07a4ed14c8c830940e83c315e7eaad6905a40aa ref_indexing.log b580b9c6257254a018a9ae22291a64892c1a3715c69272637f5c504fc5545a70 reflengths.bin 89679603ac0b28042275e5ff04b222bad3fd431cab573f0c2b61e7455aec43e7 refseq.bin 46bf28001e00d491b68bf8758b99c1f304523c79bd94a97d7797888856594e84 seq.bin 4c7e56ba28383774e786826099ef412761326fe18ce69f29033ad2886542985d versionInfo.json Following are different. ctable.bin info.json mphf.bin pos.bin pre_indexing.log ref_indexing.log seq.bin versionInfo.json

I will try creating the vM25 index with increased memory. Wonder if its not building.

Just FYI, my sha256sum on vM23 index is: 9788716f4ce42b049fe7e865108f45392bb8a5847cfcd47369512783dc918239 complete_ref_lens.bin 9c2453a47ce1808f54733f049b8c4cf38634c9116eb55ed725b73219caa101c5 ctable.bin 928ba619dc5388ccab6d5c4f8ce162e07a5b5c79028be4aee4d838f43a3b9d92 ctg_offsets.bin 0814d0e7dd8a4b126709c42728816995aefdf5a5bb6337c2d3c048cb0f56094d duplicate_clusters.tsv dcbf8e140627b3c99d4dbcdaa585447a691fddb620f137811b669e73800f9b3b info.json 5959abf5969a26481c6aa20fecbdddf19fa558e949cfbda5760205f38bb907b9 mphf.bin 28460131b85c74ffb7627761a291614757e72b4e3b82971dcc048a50cc8d9e7f pos.bin b5eb5e3fb0d03509d9fc90f6b5461c6aecc44423068f3303553cc07fffc7c1b9 pre_indexing.log eca518136526233f3dc28d9684926793cb84327242d54c1a8a20c66aa1928fad rank.bin a990247ba2b351fd0921de6470bf0c3505472d8f463e6f8b9ec7c221b6b56af8 refAccumLengths.bin 436199afbb35045a70fdc7b9e542ef805b57170f41d6bc6a0ba4d88a8ca267fc ref_indexing.log 65ce60d16b43f9e739cf68edb194daa63562c6d064a6e6bf441f612baec66983 reflengths.bin 4f3fc9b3785f8cd0e1355e31d61df87226eb7e14e4438c0afc68706937df94a3 refseq.bin 075122d399bd2c5cfd2e9e7405b2f2778c45178e9bf3a4a93f17750c808df7e0 seq.bin 4c7e56ba28383774e786826099ef412761326fe18ce69f29033ad2886542985d versionInfo.json

Summary: vM23 is working and I will proceed with it. In the meantime I will troubleshot vM25 as well as try your tarball.

Thank you very much for your quick help.

Hari

hariiyer16 avatar Aug 16 '20 13:08 hariiyer16

Hi Hari,

Some thoughts on your questions:

I had to use mem_free=34G for building index. Is that expected?

Certainly, it is not the case that index creation should require 34G of physical memory. When indexing the genome and transcriptome in dense mode, we typically expect it to require <20G of physical RAM (and <4 for just the transcriptome). However, we have noticed some strange behavior in the past about how compute clusters manage process allocation — sometimes, it seems, one must overcommit. Given the diversity of different software on which different compute clusters run, as well as the manifold way sysadmins may configure these things, we've not found a universal explanation / conclusion yet, but it does seem that the resources actually required (e.g. if you run salmon index under /usr/bin/time and look at the resources) are less than what should be requested.

The differences in the sha256 sums are a bit strange and I don't have a great explanation for them. One difference is that I built with the head of the develop branch (which has version tag 1.3.1). That describes a difference in versionInfo.json but nothing upstream in the index building should have changed, so I am not sure why the other files would have different sha256 sums. I can try with the pre-compiled binary and see if my results match yours.

In the meantime, please keep me posted. If index building ends up worth for you with a different configuration, it would be good to check this off of our list of TODOs, and if not, it would be good to get to the bottom of it.

Thanks! Rob

rob-p avatar Aug 16 '20 13:08 rob-p

Hi Rob, The cluster behavior/load might explain the indexing behaviour. Will keep you posted as I redo with vM25.

Thank you again. Hari

hariiyer16 avatar Aug 17 '20 13:08 hariiyer16