kraken2 icon indicating copy to clipboard operation
kraken2 copied to clipboard

kraken2 download library fail

Open RJBeng opened this issue 3 years ago • 22 comments

Hello,

I am using Kraken2 v2.1.2. When trying to download library ./kraken2-build --download-library viral --db viral_db I get the following error.

rsync: link_stat "/all/GCF/002/957/295/GCF_002957295.1_ASM295729v1/GCF_002957295.1_ASM295729v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync: link_stat "/all/GCF/006/869/785/GCF_006869785.1_ASM686978v1/GCF_006869785.1_ASM686978v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync: link_stat "/all/GCF/003/034/835/GCF_003034835.1_ASM303483v1/GCF_003034835.1_ASM303483v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync: link_stat "/all/GCF/002/957/515/GCF_002957515.1_ASM295751v1/GCF_002957515.1_ASM295751v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync: link_stat "/all/GCF/003/014/195/GCF_003014195.1_ASM301419v1/GCF_003014195.1_ASM301419v1_genomic.fna.gz" (in genomes) failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [generator=3.1.1] rsync_from_ncbi.pl: rsync error, exiting: 5888

Not sure what is going on. I built a library using the exact command two days ago and it was working fine.

Many thanks, Rebecca

RJBeng avatar Jun 21 '21 16:06 RJBeng

Something must have changed on the NCBI side where they left in that html path in their viral genome file but removed the genomes themselves. I'll see if there is a fix from our side.

jenniferlu717 avatar Jun 23 '21 18:06 jenniferlu717

Thank you so much for your help :)

RJBeng avatar Jun 24 '21 07:06 RJBeng

The problem actually stems from files that do not exist on the NCBI servers. It can be solved by modifying the rsync_from_ncbi.pl file. Actually, you only need to reuse part of the code from the condition to download the protein database and from the FTP, but for rsync:

 else {
 
  **system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH} . 2> rsync.err");
  open ERR_FILE, "<", "rsync.err"
    or die "$PROG: can't read rsync.err file: $!\n";
  while (<ERR_FILE>) {
    chomp;
    # I really doubt this will work across every version of rsync. :(
    if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
      delete $manifest{$1};
    }
  }
  close ERR_FILE;
  print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";

  # Rewrite manifest
  open MANIFEST, ">", "manifest.txt"
    or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;**

  print STDERR "Step 1/2: Performing rsync file transfer of requested files\n";
  system("rsync --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH}/ .") == 0
    or die "$PROG: rsync error, exiting: $?\n";
  print STDERR "Rsync file transfer complete.\n";
}

ATVincent avatar Jun 24 '21 17:06 ATVincent

@ATVincent thank you for your workaround! I am having difficulties identifying where in the rsyinc_from_ncbi.pl should we paste your code. Should we just paste it at the end of the file or do we need to substitute some lines and change them from your code. Thanks again!

ctuni avatar Jun 29 '21 10:06 ctuni

Here is the complete and modified code for the rsync_from_ncbi.pl file. Don't hesitate if there is a problem.

#!/usr/bin/env perl

# Copyright 2013-2021, Derrick Wood <[email protected]>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.

# Reads an assembly_summary.txt file, which indicates taxids and FTP paths for
# genome/protein data.  Performs the download of the complete genomes from
# that file, decompresses, and explicitly assigns taxonomy as needed.

use strict;
use warnings;
use File::Basename;
use Getopt::Std;
use Net::FTP;
use List::Util qw/max/;

my $PROG = basename $0;
my $SERVER = "ftp.ncbi.nlm.nih.gov";
my $SERVER_PATH = "/genomes";
my $FTP_USER = "anonymous";
my $FTP_PASS = "kraken2download";

my $qm_server = quotemeta $SERVER;
my $qm_server_path = quotemeta $SERVER_PATH;

my $is_protein = $ENV{"KRAKEN2_PROTEIN_DB"};
my $use_ftp = $ENV{"KRAKEN2_USE_FTP"};

my $suffix = $is_protein ? "_protein.faa.gz" : "_genomic.fna.gz";

# Manifest hash maps filenames (keys) to taxids (values)
my %manifest;
while (<>) {
  next if /^#/;
  chomp;
  my @fields = split /\t/;
  my ($taxid, $asm_level, $ftp_path) = @fields[5, 11, 19];
  # Possible TODO - make the list here configurable by user-supplied flags
  next unless grep {$asm_level eq $_} ("Complete Genome", "Chromosome");
  next if $ftp_path eq "na";  # Skip if no provided path

  my $full_path = $ftp_path . "/" . basename($ftp_path) . $suffix;
  # strip off server/leading dir name to allow --files-from= to work w/ rsync
  # also allows filenames to just start with "all/", which is nice
  if (! ($full_path =~ s#^ftp://${qm_server}${qm_server_path}/##)) {
    die "$PROG: unexpected FTP path (new server?) for $ftp_path\n";
  }
  $manifest{$full_path} = $taxid;
}

open MANIFEST, ">", "manifest.txt"
  or die "$PROG: can't write manifest: $!\n";
print MANIFEST "$_\n" for keys %manifest;
close MANIFEST;

if ($is_protein && ! $use_ftp) {
  print STDERR "Step 0/2: performing rsync dry run (only protein d/l requires this)...\n";
  # Protein files aren't always present, so we have to do this two-rsync run hack
  # First, do a dry run to find non-existent files, then delete them from the
  # manifest; after this, execution can proceed as usual.
  system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH} . 2> rsync.err");
  open ERR_FILE, "<", "rsync.err"
    or die "$PROG: can't read rsync.err file: $!\n";
  while (<ERR_FILE>) {
    chomp;
    # I really doubt this will work across every version of rsync. :(
    if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
      delete $manifest{$1};
    }
  }
  close ERR_FILE;
  print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";

  # Rewrite manifest
  open MANIFEST, ">", "manifest.txt"
    or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;
}

sub ftp_connection {
    my $ftp = Net::FTP->new($SERVER, Passive => 1)
        or die "$PROG: FTP connection error: $@\n";
    $ftp->login($FTP_USER, $FTP_PASS)
        or die "$PROG: FTP login error: " . $ftp->message() . "\n";
    $ftp->binary()
        or die "$PROG: FTP binary mode error: " . $ftp->message() . "\n";
    $ftp->cwd($SERVER_PATH)
        or die "$PROG: FTP CD error: " . $ftp->message() . "\n";
    return $ftp;
}

if ($use_ftp) {
  print STDERR "Step 1/2: Performing ftp file transfer of requested files\n";
  open MANIFEST, "<", "manifest.txt"
    or die "$PROG: can't open manifest: $!\n";
  mkdir "all" or die "$PROG: can't create 'all' directory: $!\n";
  chdir "all" or die "$PROG: can't chdir into 'all' directory: $!\n";
  while (<MANIFEST>) {
    chomp;
    my $ftp = ftp_connection();
    my $try = 0;
    my $ntries = 5;
    my $sleepsecs = 3;
    while($try < $ntries) {
        $try++;
        last if $ftp->get($_);
        warn "$PROG: unable to download $_ on try $try of $ntries: ".$ftp->message()."\n";
        last if $try == $ntries;
        sleep $sleepsecs;
        $sleepsecs *= 3;
    }
    die "$PROG: unable to download ftp://${SERVER}${SERVER_PATH}/$_\n" if $try == $ntries;
    $ftp->quit;
  }
  close MANIFEST;
  chdir ".." or die "$PROG: can't return to correct directory: $!\n";
}
else {


  system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH} . 2> rsync.err");
  open ERR_FILE, "<", "rsync.err"
    or die "$PROG: can't read rsync.err file: $!\n";
  while (<ERR_FILE>) {
    chomp;
    # I really doubt this will work across every version of rsync. :(
    if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
      delete $manifest{$1};
    }
  }
  close ERR_FILE;
  print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";

  # Rewrite manifest
  open MANIFEST, ">", "manifest.txt"
    or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;

  print STDERR "Step 1/2: Performing rsync file transfer of requested files\n";
  system("rsync --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH}/ .") == 0
    or die "$PROG: rsync error, exiting: $?\n";
  print STDERR "Rsync file transfer complete.\n";
}
print STDERR "Step 2/2: Assigning taxonomic IDs to sequences\n";
my $output_file = $is_protein ? "library.faa" : "library.fna";
open OUT, ">", $output_file
  or die "$PROG: can't write $output_file: $!\n";
my $projects_added = 0;
my $sequences_added = 0;
my $ch_added = 0;
my $ch = $is_protein ? "aa" : "bp";
my $max_out_chars = 0;
for my $in_filename (keys %manifest) {
  my $taxid = $manifest{$in_filename};
  if ($use_ftp) {  # FTP downloading doesn't create full path locally
    $in_filename = "all/" . basename($in_filename);
  }
  open IN, "gunzip -c $in_filename |" or die "$PROG: can't read $in_filename: $!\n";
  while (<IN>) {
    if (/^>/) {
      s/^>/>kraken:taxid|$taxid|/;
      $sequences_added++;
    }
    else {
      $ch_added += length($_) - 1;
    }
    print OUT;
  }
  close IN;
  unlink $in_filename;
  $projects_added++;
  my $out_line = progress_line($projects_added, scalar keys %manifest, $sequences_added, $ch_added) . "...";
  $max_out_chars = max(length($out_line), $max_out_chars);
  my $space_line = " " x $max_out_chars;
  print STDERR "\r$space_line\r$out_line" if -t STDERR;
}
close OUT;
print STDERR " done.\n" if -t STDERR;

print STDERR "All files processed, cleaning up extra sequence files...";
system("rm -rf all/") == 0
  or die "$PROG: can't clean up all/ directory: $?\n";
print STDERR " done, library complete.\n";

sub progress_line {
  my ($projs, $total_projs, $seqs, $chs) = @_;
  my $line = "Processed ";
  $line .= ($projs == $total_projs) ? "$projs" : "$projs/$total_projs";
  $line .= " project" . ($total_projs > 1 ? 's' : '') . " ";
  $line .= "($seqs sequence" . ($seqs > 1 ? 's' : '') . ", ";
  my $prefix;
  my @prefixes = qw/k M G T P E/;
  while (@prefixes && $chs >= 1000) {
    $prefix = shift @prefixes;
    $chs /= 1000;
  }
  if (defined $prefix) {
    $line .= sprintf '%.2f %s%s)', $chs, $prefix, $ch;
  }
  else {
    $line .= "$chs $ch)";
  }
  return $line;
}

ATVincent avatar Jun 29 '21 16:06 ATVincent

Hi @ATVincent, this script works for me in downloading the viral library. Many thanks.

Just a small correction, the file is rsync_from_ncbi.pl, not rsyinc_from_ncbi.pl as written at the top. In case people have problem finding the file.

choon-sim avatar Jun 30 '21 04:06 choon-sim

Thanks @choon-sim ! I modified the original comment.

ATVincent avatar Jun 30 '21 12:06 ATVincent

Thank you @ATVincent for posting here the full code to the modified script!

ctuni avatar Jun 30 '21 13:06 ctuni

I had the same problem only when downloading viruses. This solved it. Thanks! :-)

mariaasierra avatar Jul 18 '21 21:07 mariaasierra

Good workaround! And if you cannot use rsync and your are using the --use-ftp flag, the following lines do the same trick:

if ($use_ftp) {
  print STDERR "Step 1/2: Performing ftp file transfer of requested files\n";
  open MANIFEST, "<", "manifest.txt"
    or die "$PROG: can't open manifest: $!\n";
  mkdir "all" or die "$PROG: can't create 'all' directory: $!\n";
  chdir "all" or die "$PROG: can't chdir into 'all' directory: $!\n";
  while (<MANIFEST>) {
    chomp;
    my $ftp = ftp_connection();
    my $try = 0;
    my $ntries = 5;
    my $sleepsecs = 3;
    while($try < $ntries) {
        $try++;
        last if $ftp->get($_);
        warn "$PROG: unable to download $_ on try $try of $ntries: ".$ftp->message()."\n";
        if ($try == $ntries){
        	delete $manifest{$_};
        	print STDERR "$PROG: skiping non-existent file $_ from ftp://${SERVER}${SERVER_PATH}/$_\n" if $try == $ntries;
        	last;
        }
        sleep $sleepsecs;
        $sleepsecs *= 3;
    }
	$ftp->quit;
  }
  
  # Perform the same trick suggested by Anthony Vincent
  # Rewrite manifest as seen in the condition at lane 57:  if ($is_protein && ! $use_ftp) {
  open MANIFEST, ">", "manifest.txt"
  or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;
  
  chdir ".." or die "$PROG: can't return to correct directory: $!\n";
}

soda460 avatar Aug 04 '21 02:08 soda460

These errors were caused by the disagreement of assembly_summary.txt and actual file path of NCBI ftp.

Simply insert these codes into the rsync_from_ncbi.pl file, after line 41: next if $ftp_path eq "na"; # Skip if no provided path

  # Fix NCBI full_path error:
  if ($taxid eq '2053603') {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/957/275/GCF_002957275.1_ASM295727v1";
  }
  if ($taxid eq '1897434') {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/612/345/GCF_002612345.1_ASM261234v1";
  }
  if ($taxid eq "1897641") {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/613/645/GCF_002613645.1_ASM261364v1";
  }
  if ($taxid eq "1897515") {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/709/945/GCF_002709945.1_ASM270994v1";
  }
  if ($taxid eq "764348") {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/887/575/GCF_000887575.6_ASM88757v6";
  }
  if ($taxid eq "2015851") {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/626/265/GCF_002626265.2_ASM262626v2";
  }
  if ($taxid eq "1965361") {
    $ftp_path = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/619/845/GCF_002619845.1_ASM261984v1";
  }

NOTE: These errors may be fixed by NCBI in the future. Only use above code if there were REAL errors.

zer0liu avatar Oct 19 '21 03:10 zer0liu

Hi everyone!!

I used the recommendations made above and still could not resolve the error my error now is

rsync_from_ncbi.pl: unexpected FTP path (new server?) for https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/762/265/GCF_000762265.1_ASM76226v1

Rohit-Satyam avatar Apr 13 '22 17:04 Rohit-Satyam

Hello !

I'm using kraken2 v2.1.2 on a linux computer, I'm using this command : kraken2-build --standard --use-ftp --db database_kraken2

No matter if I change the "rsync_from_ncbi.pl" file from "^ftp://" to "^https://", the same error is occuring : "rsync_from_ncbi.pl: unexpected FTP path (new server?)"

Even if I try to download : bacteria, viral library the issue still here.

When I downloaded taxonomy no errors happened. Maybe it could helps.

If you have any solutions I would be grateful to use it. Thanks !

mnemosymenicolas avatar May 18 '22 05:05 mnemosymenicolas

Hi @Rohit-Satyam @mnemosymenicolas @jenniferlu717 In my department, all the ports except standard default ports ( such as 22, 443, etc. ) are blocked by the IT. Maybe you need to request the IT to open the relevant port for your download (though I'm not sure which port is the relevant port. Maybe someone else can comment on this).

A workaround method is to download the data to your PC, and then upload to your linux before building the database.

choon-sim avatar May 18 '22 06:05 choon-sim

Did this get solved?

I ran into the same problem this morning after installing Kraken2 from conda:

Step 1/2: Performing rsync file transfer of requested files
rsync: link_stat "/all/GCF/003/143/375/GCF_003143375.1_ASM314337v1/GCF_003143375.1_ASM314337v1_genomic.fna.gz" (in genomes) failed: No such file or directory
(2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1816) [generator=3.2.3]
rsync_from_ncbi.pl: rsync error, exiting: 5888

It sure would be good for kraken2 not to fail on these errors.....

mw55309 avatar Aug 28 '22 10:08 mw55309

For anyone landing here, I found a rather simple solution. I'm not sure if the outcome data is reliable as a result (I'm helping a friend get set up for their work; I'm "just a computer guy"). I just took away the conditional that encapsulates the "Step 0/2" (dry run) step, which already has code to omit missing files from the manifest if rsync complains. Keep the interior of the block and de-dent it after removing the if (...) { (and the closing } at the end of the block), and you should be in business! It's a little slow doing the full dry run(s) but it's at least functioning!

josefdlange avatar Mar 16 '23 23:03 josefdlange

@josefdlange

It would help if you could paste the actual modified script here, or as a GitHub gist.

amizeranschi avatar Mar 26 '23 19:03 amizeranschi

@amizeranschi I've gone ahead and opened a Pull Request with the changes I've made. Please see https://github.com/DerrickWood/kraken2/pull/705 for the changes.

Just to set expectations, I don't intend on maintaining my own fork of the project in the future, so please either manually make your own changes based off mine in your local clone of the repository, or follow progress of the PR to see if @DerrickWood pulls the PR into this main kraken2 repository.

josefdlange avatar Mar 27 '23 16:03 josefdlange

I ran into a different problem that gives the same error message. I'm downloading with rsync, but rsync --dry-run doesn't actually complain about the missing files, and so rsync.err doesn't have those complaints, and the missing files aren't excluded from the manifest. I "fixed" this by removing the --dry-run argument, but this makes the "check if the files are there" step very slow and it would be better fix this in the actual downloading step.

jeffkaufman avatar Jul 14 '23 12:07 jeffkaufman

Hello, I'm using kraken2 version 2.1.1

last week when I ran kraken2-build --download-library viral, it gave me below error: MicrosoftTeams-image (5)

However, when I ran the same command just today without changing any code, it ran through successfully without error.

I'm puzzled what is the reason behind it? assuming it's because that the NCBI https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/ gets constant update? how do we prevent this error from happening? Any code change needed for the rsync_from_ncbi.pl script?

FreddieLPF avatar Aug 31 '23 18:08 FreddieLPF

Hello, I'm using kraken2 version 2.1.1

last week when I ran kraken2-build --download-library viral, it gave me below error: MicrosoftTeams-image (5)

However, when I ran the same command just today without changing any code, it ran through successfully without error.

I'm puzzled what is the reason behind it? assuming it's because that the NCBI https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/ gets constant update? how do we prevent this error from happening? Any code change needed for the rsync_from_ncbi.pl script?

Hi, how many "rsync: link_stat "/all/GCF/000/846/845/GCF_000846845.1_ViralProj14525/GCF_000846845.1_ViralProj14525_protein.faa.gz" (in genomes) failed: No such file or directory (2)" did you get? I got tons, even after I did the modification @ATVincent did.

Scott-0208 avatar Jul 26 '24 17:07 Scott-0208

Here is the complete and modified code for the rsync_from_ncbi.pl file. Don't hesitate if there is a problem.

#!/usr/bin/env perl

# Copyright 2013-2021, Derrick Wood <[email protected]>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.

# Reads an assembly_summary.txt file, which indicates taxids and FTP paths for
# genome/protein data.  Performs the download of the complete genomes from
# that file, decompresses, and explicitly assigns taxonomy as needed.

use strict;
use warnings;
use File::Basename;
use Getopt::Std;
use Net::FTP;
use List::Util qw/max/;

my $PROG = basename $0;
my $SERVER = "ftp.ncbi.nlm.nih.gov";
my $SERVER_PATH = "/genomes";
my $FTP_USER = "anonymous";
my $FTP_PASS = "kraken2download";

my $qm_server = quotemeta $SERVER;
my $qm_server_path = quotemeta $SERVER_PATH;

my $is_protein = $ENV{"KRAKEN2_PROTEIN_DB"};
my $use_ftp = $ENV{"KRAKEN2_USE_FTP"};

my $suffix = $is_protein ? "_protein.faa.gz" : "_genomic.fna.gz";

# Manifest hash maps filenames (keys) to taxids (values)
my %manifest;
while (<>) {
  next if /^#/;
  chomp;
  my @fields = split /\t/;
  my ($taxid, $asm_level, $ftp_path) = @fields[5, 11, 19];
  # Possible TODO - make the list here configurable by user-supplied flags
  next unless grep {$asm_level eq $_} ("Complete Genome", "Chromosome");
  next if $ftp_path eq "na";  # Skip if no provided path

  my $full_path = $ftp_path . "/" . basename($ftp_path) . $suffix;
  # strip off server/leading dir name to allow --files-from= to work w/ rsync
  # also allows filenames to just start with "all/", which is nice
  if (! ($full_path =~ s#^ftp://${qm_server}${qm_server_path}/##)) {
    die "$PROG: unexpected FTP path (new server?) for $ftp_path\n";
  }
  $manifest{$full_path} = $taxid;
}

open MANIFEST, ">", "manifest.txt"
  or die "$PROG: can't write manifest: $!\n";
print MANIFEST "$_\n" for keys %manifest;
close MANIFEST;

if ($is_protein && ! $use_ftp) {
  print STDERR "Step 0/2: performing rsync dry run (only protein d/l requires this)...\n";
  # Protein files aren't always present, so we have to do this two-rsync run hack
  # First, do a dry run to find non-existent files, then delete them from the
  # manifest; after this, execution can proceed as usual.
  system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH} . 2> rsync.err");
  open ERR_FILE, "<", "rsync.err"
    or die "$PROG: can't read rsync.err file: $!\n";
  while (<ERR_FILE>) {
    chomp;
    # I really doubt this will work across every version of rsync. :(
    if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
      delete $manifest{$1};
    }
  }
  close ERR_FILE;
  print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";

  # Rewrite manifest
  open MANIFEST, ">", "manifest.txt"
    or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;
}

sub ftp_connection {
    my $ftp = Net::FTP->new($SERVER, Passive => 1)
        or die "$PROG: FTP connection error: $@\n";
    $ftp->login($FTP_USER, $FTP_PASS)
        or die "$PROG: FTP login error: " . $ftp->message() . "\n";
    $ftp->binary()
        or die "$PROG: FTP binary mode error: " . $ftp->message() . "\n";
    $ftp->cwd($SERVER_PATH)
        or die "$PROG: FTP CD error: " . $ftp->message() . "\n";
    return $ftp;
}

if ($use_ftp) {
  print STDERR "Step 1/2: Performing ftp file transfer of requested files\n";
  open MANIFEST, "<", "manifest.txt"
    or die "$PROG: can't open manifest: $!\n";
  mkdir "all" or die "$PROG: can't create 'all' directory: $!\n";
  chdir "all" or die "$PROG: can't chdir into 'all' directory: $!\n";
  while (<MANIFEST>) {
    chomp;
    my $ftp = ftp_connection();
    my $try = 0;
    my $ntries = 5;
    my $sleepsecs = 3;
    while($try < $ntries) {
        $try++;
        last if $ftp->get($_);
        warn "$PROG: unable to download $_ on try $try of $ntries: ".$ftp->message()."\n";
        last if $try == $ntries;
        sleep $sleepsecs;
        $sleepsecs *= 3;
    }
    die "$PROG: unable to download ftp://${SERVER}${SERVER_PATH}/$_\n" if $try == $ntries;
    $ftp->quit;
  }
  close MANIFEST;
  chdir ".." or die "$PROG: can't return to correct directory: $!\n";
}
else {


  system("rsync --dry-run --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH} . 2> rsync.err");
  open ERR_FILE, "<", "rsync.err"
    or die "$PROG: can't read rsync.err file: $!\n";
  while (<ERR_FILE>) {
    chomp;
    # I really doubt this will work across every version of rsync. :(
    if (/failed: No such file or directory/ && /^rsync: link_stat "\/([^"]+)"/) {
      delete $manifest{$1};
    }
  }
  close ERR_FILE;
  print STDERR "Rsync dry run complete, removing any non-existent files from manifest.\n";

  # Rewrite manifest
  open MANIFEST, ">", "manifest.txt"
    or die "$PROG: can't write manifest: $!\n";
  print MANIFEST "$_\n" for keys %manifest;
  close MANIFEST;

  print STDERR "Step 1/2: Performing rsync file transfer of requested files\n";
  system("rsync --no-motd --files-from=manifest.txt rsync://${SERVER}${SERVER_PATH}/ .") == 0
    or die "$PROG: rsync error, exiting: $?\n";
  print STDERR "Rsync file transfer complete.\n";
}
print STDERR "Step 2/2: Assigning taxonomic IDs to sequences\n";
my $output_file = $is_protein ? "library.faa" : "library.fna";
open OUT, ">", $output_file
  or die "$PROG: can't write $output_file: $!\n";
my $projects_added = 0;
my $sequences_added = 0;
my $ch_added = 0;
my $ch = $is_protein ? "aa" : "bp";
my $max_out_chars = 0;
for my $in_filename (keys %manifest) {
  my $taxid = $manifest{$in_filename};
  if ($use_ftp) {  # FTP downloading doesn't create full path locally
    $in_filename = "all/" . basename($in_filename);
  }
  open IN, "gunzip -c $in_filename |" or die "$PROG: can't read $in_filename: $!\n";
  while (<IN>) {
    if (/^>/) {
      s/^>/>kraken:taxid|$taxid|/;
      $sequences_added++;
    }
    else {
      $ch_added += length($_) - 1;
    }
    print OUT;
  }
  close IN;
  unlink $in_filename;
  $projects_added++;
  my $out_line = progress_line($projects_added, scalar keys %manifest, $sequences_added, $ch_added) . "...";
  $max_out_chars = max(length($out_line), $max_out_chars);
  my $space_line = " " x $max_out_chars;
  print STDERR "\r$space_line\r$out_line" if -t STDERR;
}
close OUT;
print STDERR " done.\n" if -t STDERR;

print STDERR "All files processed, cleaning up extra sequence files...";
system("rm -rf all/") == 0
  or die "$PROG: can't clean up all/ directory: $?\n";
print STDERR " done, library complete.\n";

sub progress_line {
  my ($projs, $total_projs, $seqs, $chs) = @_;
  my $line = "Processed ";
  $line .= ($projs == $total_projs) ? "$projs" : "$projs/$total_projs";
  $line .= " project" . ($total_projs > 1 ? 's' : '') . " ";
  $line .= "($seqs sequence" . ($seqs > 1 ? 's' : '') . ", ";
  my $prefix;
  my @prefixes = qw/k M G T P E/;
  while (@prefixes && $chs >= 1000) {
    $prefix = shift @prefixes;
    $chs /= 1000;
  }
  if (defined $prefix) {
    $line .= sprintf '%.2f %s%s)', $chs, $prefix, $ch;
  }
  else {
    $line .= "$chs $ch)";
  }
  return $line;
}

Does anyone get the same error after the modification? Let's figure it out! Thank you!

Scott-0208 avatar Jul 26 '24 17:07 Scott-0208