Struo2 icon indicating copy to clipboard operation
Struo2 copied to clipboard

kraken2-build error when creating sequence ID to taxonomy ID map

Open joshsimcock opened this issue 2 years ago • 9 comments

Hi,

I'm near the end of the Struo2 pipeline trying to create a custom kraken2 database using gtdb r207.

I've hit a wall though at the kraken2-build command, specifically one spot within the build_kraken2_db.sh script that the command calls. It seems that this section:

echo "Creating sequence ID to taxonomy ID map (step 1)..."
if [ -d "library/added" ]; then
  find library/added/ -name 'prelim_map_*.txt' | xargs cat > library/added/prelim_map.txt
fi
seqid2taxid_map_file=seqid2taxid.map
if [ -e "$seqid2taxid_map_file" ]; then
  echo "Sequence ID to taxonomy ID map already present, skipping map creation."
else
  step_time=$(get_current_time)
  find library/ -maxdepth 2 -name prelim_map.txt | xargs cat > taxonomy/prelim_map.txt
  if [ ! -s "taxonomy/prelim_map.txt" ]; then
    echo "No preliminary seqid/taxid mapping files found, aborting."
    exit 1
  fi
  grep "^TAXID" taxonomy/prelim_map.txt | cut -f 2- > $seqid2taxid_map_file.tmp || true
  if grep "^ACCNUM" taxonomy/prelim_map.txt | cut -f 2- > accmap_file.tmp; then
    if compgen -G "taxonomy/*.accession2taxid" > /dev/null; then
      lookup_accession_numbers accmap_file.tmp taxonomy/*.accession2taxid > seqid2taxid_acc.tmp
      cat seqid2taxid_acc.tmp >> $seqid2taxid_map_file.tmp
      rm seqid2taxid_acc.tmp
    else
      echo "Accession to taxid map files are required to build this DB."
      echo "Run 'kraken2-build --db $KRAKEN2_DB_NAME --download-taxonomy' again?"
      exit 1
    fi
  fi
  rm -f accmap_file.tmp
  finalize_file $seqid2taxid_map_file
  echo "Sequence ID to taxonomy ID map complete. [$(report_time_elapsed $step_time)]"
fi

Produces the error messages:

Accession to taxid map files are required to build this DB.
Run 'kraken2-build --db $KRAKEN2_DB_NAME --download-taxonomy again?

When I try to run through this line by line myself everything is fine until lookup_accession_numbers accmap_file.tmp taxonomy/*.accession2taxid > seqid2taxid_acc.tmp at which point I get the error Found 0/1363031 targets...lookup_accession_numbers: unable to open taxonomy/*.accession2taxid: No such file or directory

my ./taxonomy/ directory only contains the following:

-rw-r--r--+ 1  names.dmp
-rw-r--r--+ 1  nodes.dmp
drwxr-sr-x+ 2  .
-rw-r--r--+ 1  prelim_map.txt
drwxr-sr-x+ 5  ..

Should there be accession2taxid files in here? If so, when should they have been generated?

Happy to post on the kraken2 github if this is more appropriate but figured this maybe something that should have been generated elsewhere in the Struo2 pipeline.

Any help much appreciated, thanks!

joshsimcock avatar May 09 '22 12:05 joshsimcock