dentist icon indicating copy to clipboard operation
dentist copied to clipboard

md5checksum shows example dataset analysis fails

Open RishiDeKayne opened this issue 3 years ago • 12 comments

Hi, I've been trying to use dentist on the provided example dataset but a number of the md5 check sums after it finishes running are failing with no other errors that I can find.

I installed snakemake v6.0.0 and singularity v3.6.3 through conda and ran through the example dataset as follows:

wget https://bds.mpi-cbg.de/hillerlab/DENTIST/dentist-example.v1.0.1.tar.gz
tar -xzf ./dentist-example.v1.0.1.tar.gz
cd dentist-example

# run the workflow
SKIP_LACHECK=1 snakemake --configfile=snakemake.yaml --use-singularity --cores=4 

# validate the files
md5sum -c checksum.md5

but the checksum output was as follows:

gap-closed.fasta: FAILED
workdir/.assembly-test.bps: OK
workdir/.assembly-test.dentist-reads.anno: OK
workdir/.assembly-test.dentist-reads.data: OK
workdir/.assembly-test.dentist-self.anno: OK
workdir/.assembly-test.dentist-self.data: OK
workdir/.assembly-test.dust.anno: OK
workdir/.assembly-test.dust.data: OK
workdir/.assembly-test.hdr: OK
workdir/.assembly-test.idx: OK
workdir/.assembly-test.tan.anno: OK
workdir/.assembly-test.tan.data: OK
workdir/.gap-closed-preliminary.bps: FAILED
workdir/.gap-closed-preliminary.dentist-self.anno: FAILED
workdir/.gap-closed-preliminary.dentist-self.data: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.anno: FAILED
workdir/.gap-closed-preliminary.dentist-weak-coverage.data: FAILED
workdir/.gap-closed-preliminary.dust.anno: FAILED
workdir/.gap-closed-preliminary.dust.data: FAILED
workdir/.gap-closed-preliminary.hdr: OK
workdir/.gap-closed-preliminary.idx: FAILED
workdir/.gap-closed-preliminary.tan.anno: FAILED
workdir/.gap-closed-preliminary.tan.data: FAILED
workdir/.reads.bps: OK
workdir/.reads.idx: OK
workdir/assembly-test.assembly-test.las: OK
workdir/assembly-test.dam: OK
workdir/assembly-test.reads.las: OK
workdir/gap-closed-preliminary.dam: FAILED
workdir/gap-closed-preliminary.fasta: FAILED
workdir/gap-closed-preliminary.gap-closed-preliminary.las: FAILED
workdir/gap-closed-preliminary.reads.las: FAILED
workdir/reads.db: OK
md5sum: WARNING: 15 computed checksums did NOT match

any advice on how to get the example dataset running would be greatly appreciated, Thanks, Rishi

RishiDeKayne avatar Mar 16 '21 14:03 RishiDeKayne

Hi Rishi, could you share one of the logs/process.*.log files? Somebody else experienced failing md5sums like you do and the reason was that one of the auxiliary tools crashed in most of the calls for a yet unknown reason. Could you also share some more information about your system?

lsb_release -a
free -h

a-ludi avatar Mar 16 '21 20:03 a-ludi

Sure, I have attached process.1.log and the system info is as follows:

lsb_release -a

output:

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
free -h

output:

              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

process.1.log

RishiDeKayne avatar Mar 16 '21 20:03 RishiDeKayne

As I suspected, it is the same error memory-associated error:

$ jq 'select((.exitStatus // 0) != 0)' process.1.log | head -n50
{
  "thread": 140513968151344,
  "logLevel": "diagnostic",
  "state": "post",
  "command": [
    "computeintrinsicqv",
    "-d19",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.db",
    "/tmp/dentist-processPileUps-OeaddP/pileup-55b-56f.pileup-55b-56f-chained-filtered.las"
  ],
  "output": [
    "allocation failure: Invalid argument cachelinesize=0 requested size is 24",
    "AutoArray<unsigned long,alloc_type_memalign_cacheline> failed to allocate 3 elements (24 bytes)",
    "current total allocation 467987",
    "",
    ""
  ],
  "exitStatus": 1,
  "timestamp": 637514242850290800,
  "action": "execute",
  "type": "command"
}
... (many more instances with the same signature)

The problem is clearly not related to a lack of memory. Since I have no in-depth knowledge of computeintrinsicqv, I will ask the author for help.

In the meantime, you may try running it on a different machine.

a-ludi avatar Mar 17 '21 08:03 a-ludi

Information from other user:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          1.0Ti        15Gi       2.5Gi       4.1Gi       989Gi       982Gi
Swap:            0B          0B          0B
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"

BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

a-ludi avatar Mar 17 '21 08:03 a-ludi

Hi again, Weirdly I reran the example set each of our computing nodes - it failed on every one of our big memory machines but ran on our regular machines. I did the same system checks as above but cant find anything obviously different between the two so I'm still not sure what could be causing it. In case it is helpful:

$ lsb_release -a

##WORKED - regular 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

##FAILED - big-memory 
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

$ free -h

##WORKED - regular
              total        used        free      shared  buff/cache   available
Mem:           360G        5.2G        332G        4.4M         21G        354G
Swap:          8.0G        8.0G         88K

##FAILED - big memory  
              total        used        free      shared  buff/cache   available
Mem:           995G        1.8G        229G        324K        764G        988G
Swap:          8.0G        1.5G        6.5G

and now all checksum outputs say 'OK'

RishiDeKayne avatar Mar 19 '21 12:03 RishiDeKayne

Hmm, interesting. I will try running the example on a 1TB memory machine as well. Maybe there is some bug related to large pointers.

a-ludi avatar Mar 22 '21 11:03 a-ludi

Hi, I have the same issue. md5sum -c checksum.md5 failed (15 cases). I am using a machine with 2 TB RAM (Ubuntu).

shri1984 avatar Apr 06 '21 06:04 shri1984

I tried it on one of our big memory machines and it worked as expected:

# submit job with 8 cores
$ sbatch -c8 -pbigmem --wrap='snakemake --configfile=snakemake.yaml --use-singularity --cores=$SLURM_JOB_CPUS_PER_NODE'
# memory information about the machine
$ ssh r01n03 free -h
              total        used        free      shared  buff/cache   available
Mem:           1.0T        964G         40G        1.6G        2.5G         39G
Swap:            0B          0B          0B
# OS information about the machine
$ ssh r01n03 lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:la
nguages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:        7.4.1708
Codename:       Core

So I conjecture (:smile:) that it is not just the amount of total or available memory that causes the bug. But I still have no clue what's going on. Also, I have not heard anything from the author of daccord (see this issue). I will keep digging.

a-ludi avatar Apr 19 '21 07:04 a-ludi

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2161e3345283553e51671b702fcf73ce45). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

a-ludi avatar Jun 22 '21 12:06 a-ludi

Thanks @a-ludi. example data set went fine including the md5sum. The latest version helped. I am trying dentist on my hic scaffolded hifi assembly. I will post the update here.

shri1984 avatar Aug 11 '21 19:08 shri1984

@shri1984 @RishiDeKayne I hope you are still interested in DENTIST after all this time but I think I have fixed the bug (25f96d2). I would be very happy if you could test the example again and see if it works.

The issue (likely) was that I used Alpine Linux in the Container which has its own libc implementation that is not 100% compatible with glibc used in common Distros like Ubuntu. I switched to Ubuntu and the error went away on one of my machines.

Thanks for your work, but I get the same issue with example data by the latest version (v4.0.0) — md5sum -c checksum.md5 failed (15 cases). The information about the machine is:

              total        used        free      shared  buff/cache   available
Mem:           2.0T        535G        1.4T         56M        5.7G        1.4T
Swap:          4.0G        2.1G        1.9G

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.5.1804 (Core) 
Release:	7.5.1804
Codename:	Core

lizhao007 avatar Nov 26 '22 09:11 lizhao007

Hi @lizhao007 ,

could you please share the list of files that failed the checksum test? I need it to get an idea what went wrong.

a-ludi avatar Dec 05 '22 14:12 a-ludi