bioperl-live icon indicating copy to clipboard operation
bioperl-live copied to clipboard

Bio::DB issue on MAC using APFS (ssd hard drive).

Open Juke34 opened this issue 6 years ago • 18 comments

Since I have updated my Mac to High Sierra, I cannot index properly the fasta files.

There is problem with the IDs, when you print them you don't see any difference, but if you look at the presence of unprintable character, there is several million of NULL character (ASCII 0) at the end of each ID.

my $db = Bio::DB::Fasta->new($fastafile);
my @ids      = $db->get_all_primary_ids;
foreach my $id (@ids) {
    
    if ($id =~ /[^[:print:]]/) {
        printf("Contains unprintable characters: '%s'\n", $id);
        printf("String length is %d\n", length($id));
   }
}

Indexing a fasta file of 208Kb will end up to a index file of 37.7Mb. So, for a fasta file of 25Mb you end up with an index of 1.52Gb.

The exact same code with the same bioperl version and same perl works perfectly fine on other computer. Since High Sierra version they introduced the Apple File System for OS using SSD hard drive (which is my case). I'm convince that is the problem. But I'm if we can fix that from the bioperl side...

Juke34 avatar Feb 02 '18 11:02 Juke34

I am now unable to get cpanm Bio::Perl to install on OS X, whereas Linux works fine.

The issue is the Bio::DB:: stuff as @cjfields had in https://github.com/bioperl/bioperl-live/issues/264

I am using brewed perl in both cases, and have the latest berkeley-db (and @4.0 version too) installed. I ha

#   Failed test at t/LocalDB/Fasta.t line 160.
#          got: '0'
#     expected: '7'

#   Failed test at t/LocalDB/Fasta.t line 243.
#     Structures begin differing at:
#          $got->[0] = '^@^@^G8^@^@^A^Y^@^@^A^Y^@G^@^C^C0^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@'
#     expected: '17601976'

#   Failed test at t/LocalDB/Qual.t line 26.
#          got: undef
#     expected: '17601991'
Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at Bio/DB/IndexedBase.pm line 304.

#   Failed test at t/LocalDB/Qual.t line 76.

#   Failed test 'undef isa 'Bio::Seq::PrimaryQual''
#   at t/LocalDB/Qual.t line 77.
#     undef isn't defined

#   Failed test 'undef isa 'Bio::Seq::PrimaryQual''
#   at t/LocalDB/Qual.t line 82.
#     undef isn't defined
(build log file is 16 MB in size!)

I also am on latest OS X and I have the APFS file system now too. I am wondering if there is some 32/64 bit file offset deal going on?

The _strip_crlf is odd too - it is Inline::C which should have a fallback to Perl version.

Maybe we need to bring Lincoln back to fix this :-P

tseemann avatar Feb 15 '18 06:02 tseemann

grr, that _strip_crlf should really be using something more consistent like the implementation in Bio::Root::IO. Short gain of speed for a significant gain in pain.

@tseemann I should add, I'm using perlbrew installations.

cjfields avatar Feb 16 '18 02:02 cjfields

@tseemann Happened to have an older version of perl installed (pre-High Sierra), which hangs forever (my guess it's building the index), and then fails, though not with the same errors (similar enough). The error isn't due to Inline::C, I get this with or without it installed:

#   Failed test at t/LocalDB/Fasta.t line 243.
#     Structures begin differing at:
#          $got->[0] = '8G0'
#     $expected->[0] = ''
ok 92
ok 93 - An object of class 'Bio::PrimarySeq::Fasta' isa 'Bio::PrimarySeqI'
ok 94 - Make multiple IDs, bug \#3389
not ok 95

#   Failed test at t/LocalDB/Fasta.t line 254.
#     Structures begin differing at:
#          $got->[0] = '�G�0'
#     $expected->[0] = ''
ok 96
ok 97 - An object of class 'Bio::PrimarySeq::Fasta' isa 'Bio::PrimarySeqI'
ok 98 - Index a set of files
ok 99
ok 100
not ok 101

#   Failed test at t/LocalDB/Fasta.t line 266.
#     Structures begin differing at:
#          $got->[0] = '1ok 102
'
#     $expected->[0] = '0'

Also, turns out there is a pure-perl fallback:

C-based: https://github.com/bioperl/bioperl-live/blob/master/Bio/DB/IndexedBase.pm#L250

Pure-perl: https://github.com/bioperl/bioperl-live/blob/master/Bio/DB/IndexedBase.pm#L273

I can also see why they didn't use the Bio::Root::IO path either, since it's effectively stripping out any newlines (not the buffering in the Bio::Root version).

Are you using a version of perl compiled from Mac OS X 10.12, carried over to the latest version?

cjfields avatar Feb 16 '18 03:02 cjfields

I'm pure brew AFAIK. I use cpanm for everything, from brew also.

brew info perl
perl: stable 5.26.1 (bottled), HEAD
Highly capable, feature-rich programming language
https://www.perl.org/
/usr/local/Cellar/perl/5.26.1 (3,009 files, 61.7MB) *
  Poured from bottle on 2017-10-16 at 10:56:02


$ perl -v
This is perl 5, version 26, subversion 1 (v5.26.1) built for darwin-thread-multi-2level


$ which perl
/usr/local/bin/perl


$ perl -V

  Platform:
    osname=darwin
    osvers=17.0.0

  @INC:
    /usr/local/Cellar/perl/5.26.1/lib/perl5/site_perl/5.26.1/darwin-thread-multi-2level
    /usr/local/Cellar/perl/5.26.1/lib/perl5/site_perl/5.26.1
    /usr/local/Cellar/perl/5.26.1/lib/perl5/5.26.1/darwin-thread-multi-2level
    /usr/local/Cellar/perl/5.26.1/lib/perl5/5.26.1
    /usr/local/lib/perl5/site_perl/5.26.1

$ find /usr/local -name SeqIO.pm
/usr/local/Cellar/perl/5.26.0/lib/perl5/site_perl/5.26.0/Bio/SeqIO.pm

tseemann avatar Feb 18 '18 23:02 tseemann

@cjfields sorry but all the macs i have access to are High Sierra now.

tseemann avatar Mar 02 '18 23:03 tseemann

@tseemann I'll see if I can debug this over the weekend. Seems like it's creating a corrupt index.

The older version I mention above seems to have a very similar issue, primarily that the final index is quite large. I thought this might be an issue with DB_File or the Any_DBM backend, so I think the fix was a full brew upgrade after an update (which had libdb), then reinstalling the relevant modules (I think DB_File) modules in my perlbrew. I also tried a clean perlbrew installation that also worked. Can you try that on your end?

cjfields avatar Mar 03 '18 01:03 cjfields

Some related issues

https://github.com/GMOD/jbrowse/issues/946

https://github.com/GMOD/Apollo/issues/1820

The workaround suggested by jbrowse is to use that non-mac perl e.g. run

brew install berkeley-db; brew install --build-from-source perl

So that perlbrew step should fix it too @cjfields

cmdcolin avatar May 08 '18 17:05 cmdcolin

@cmdcolin I am not using macperl. I use brew install perl cpan-minus and use cpanm to install all in the brew hierarchy. so it's something else?

tseemann avatar May 13 '18 02:05 tseemann

@tseemann I encountered this when carrying over a compiled perl/bdb from OS X 10.12; I had to reinstall berkeley-db, then reinstall perl and the Bioperl dependencies. I used brew install berkeley-db but then used perlbrew for the rest, which does a source-based installation.

We could add in a check similar to the one @rbuels made for JBrowse above.

cjfields avatar May 13 '18 14:05 cjfields

Does this mean using system perl on a Mac is no longer feasible for bioperl users? I haven't been able to install/use bioperl for months now, except on an obsolete Mac without an SSD/APFS.

enozkan avatar Jun 21 '18 05:06 enozkan

@enozkan system perl shouldn't be affected unless you are using an older version of the perl/bdb library. Are you seeing the same issue reported here?

cjfields avatar Jun 21 '18 11:06 cjfields

I have been having the same issues. Enormous index files, and installing the bioperl module fails in cpan (same issues with fink packages as well). I need to figure out where the berkeley-db libraries are coming from, then (I think have the same issue with or without fink package for berkeley-db, so that's not likely it).

enozkan avatar Jun 21 '18 15:06 enozkan

@enozkan I found this when I upgraded but retained my perl installation along with precompiled libraries. So the problem won't go away if you merely update berkeley-db; you also need to re-install DB_File etc so that it recompiles against the updated version.

cjfields avatar Jun 21 '18 15:06 cjfields

Yikes. This is somewhat going over my head, but I think I understand the problem. Thank you. This thread gives it some more context: https://discussions.apple.com/thread/8125401

enozkan avatar Jun 21 '18 15:06 enozkan

@enozkan Yep, that's precisely the issue, thanks for that link! I don't think this is an easy one to work around, but we can maybe try to catch it. It very well may be due to APFS changes, and though Apple may fix their system berkeley-db it would still require recompiling DB_File (or any other library that used it).

cjfields avatar Jun 21 '18 15:06 cjfields

I finally had to fix this. Using a completely fresh Mac OS 10.13.6 install did not fix this, so (if I understand this correctly) the system berkeley-db/Perl DB_File are still broken. I bit the bullet and installed berkeley-db with Homebrew, compiled perl 5.28.0 through perlbrew, and after a few cycles of failed tests and installing more dependency modules with cpanm, I got Bioperl back. Thank you.

enozkan avatar Aug 16 '18 05:08 enozkan

Updating to macOS Mojave (10.14 (18A391)) has fixed the issue for me.

Juke34 avatar Dec 03 '18 14:12 Juke34

About the only way we can address this internally is to try detecting a defunct DB_File, then point to this thread. File system updates suck.

cjfields avatar Dec 03 '18 14:12 cjfields