grabix icon indicating copy to clipboard operation
grabix copied to clipboard

incorrect output

Open drmjc opened this issue 11 years ago • 11 comments

Hi Aaron, Whilst on the way to getting up to speed on gemini, I've had a play with grabix. Unfortunately it's not playing nicely on our CentOS systems. Following the examples in the README.md, 'grab' always returns the first row in BED file, N times & 'random' gives me a seg fault:

$ bgzip simrep.chr1.bed $ zcat simrep.chr1.bed.gz | head -n 10 chr1 10000 10468 trf 789 chr1 10627 10800 trf 346 chr1 10757 10997 trf 434 chr1 11225 11447 trf 273 chr1 11271 11448 trf 187 chr1 11283 11448 trf 199 chr1 19305 19443 trf 242 chr1 20828 20863 trf 70 chr1 30862 30959 trf 79 chr1 44835 44876 trf 73

$ ./grabix index simrep.chr1.bed.gz -- worked fine

-- this reports the 1st line, not the 100th $ ./grabix grab simrep.chr1.bed.gz 100 chr1 10000 10468 trf 789

-- this reports the first line 10x $ ./grabix grab simrep.chr1.bed.gz 100 110 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789 chr1 10000 10468 trf 789

$ ./grabix random simrep.chr1.bed.gz 10 Segmentation fault

some debugging info

$ uname -a Linux gamma01.local 2.6.32-220.13.1.el6.x86_64 #1 SMP Tue Apr 17 23:56:34 BST 2012 x86_64 x86_64 x86_64 GNU/Linux $ ldd grabix linux-vdso.so.1 => (0x00007fff57000000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000032cac00000) libz.so.1 => /lib64/libz.so.1 (0x00000032c8c00000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032ca000000) libc.so.6 => /lib64/libc.so.6 (0x00000032c7c00000) libm.so.6 => /lib64/libm.so.6 (0x00000032c8000000) /lib64/ld-linux-x86-64.so.2 (0x00000032c7800000) $ cat /etc/centos-release CentOS release 6.2 (Final)

My C's a little rusty, so hopefully you'll know what's going on far faster than me cheers, Mark

drmjc avatar Jul 31 '13 12:07 drmjc

I am having precisely the same problem as Mark

$> uname -a Linuxn0640.edu 2.6.32-358.6.2.el6.x86_64 #1 SMP Tue May 14 15:48:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux $> ldd grabix linux-vdso.so.1 => (0x00007fffd9bff000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000384ae00000) libz.so.1 => /lib64/libz.so.1 (0x00000038fc600000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000384a600000) libc.so.6 => /lib64/libc.so.6 (0x0000003844200000) libm.so.6 => /lib64/libm.so.6 (0x0000003844600000) /lib64/ld-linux-x86-64.so.2 (0x0000003843a00000)

gr8ape avatar Sep 03 '13 18:09 gr8ape

Mark,

A colleague of mine, Daniel Kinnamon, was gracious enough to walk through the gdb debgugger with me, and, after some iniitial frustration, we think we found the problem. A simple adjustment fixes the incorrect output issue under CentOSx64. Seems to be an issue with what C/C++ standard is used during compilation that impacts the grabix index building process. But instead of trying to figure out what compilation options would get it to work, we just adjusted the source to circumvent the issue. (presumably, there is a compiler option that would alter how default declarations work.)

The line immediately below in grabix.cpp intends to set both prev_offset and offset variables to 0. The first variable is set via defaulting to zero, and the second by a direct declaration. But this only happens as intended on some (apparently most) platforms, depending on compiler settings. When we stepped through, prev_offset just contained trash on our system. So changing the line below so that both variables are independently set to zero fixes the issue.

int64_t prev_offset, offset = 0;

while ((status = bgzf_getline(bgzf_fp, '\n', line)) >= 0)
{
    offset = bgzf_tell (bgzf_fp);
    if (line->s[0] != '#')
        break;
    prev_offset = offset;
}
index_file << prev_offset << endl;

Replacing that first line above with the following does the trick. Although note that I wasn't certain if offset was also intended to be of type int64_t as well. The adjustment below seems to work, in any case.

int64_t prev_offset = 0; int64_t offset = 0;

Hopefully this will help others who experienced similar frustrations trying to get gemini up and running...

Cheers, Dale

gr8ape avatar Sep 04 '13 18:09 gr8ape

Thanks all, I will try to incorporate fixes for this and another outstanding bug soon. I have been horribly bogged down with other deadlines. Thanks for the patience.

arq5x avatar Sep 05 '13 01:09 arq5x

Thanks Dale & Aaron, great stuff. that fix sounds like it would fix the random subset mode, but perhaps not the 'always selecting the same first rown N times bug'?

M

drmjc avatar Sep 05 '13 02:09 drmjc

Mark, it actually fixes the "always selecting the same row" issue. it's an odd side-effect of the index being mucked up. I hadn't checked the random pull function b/c I wasn't making use of it. just checked, and this does not fix it. However, I suspect it's a similar declaration issue in a different part of the code. will try to take a look tomorrow

gr8ape avatar Sep 05 '13 02:09 gr8ape

ah ok, sorry I should have looked at the context of your fix!

drmjc avatar Sep 05 '13 10:09 drmjc

I just pushed a change that should address these issues.

arq5x avatar Sep 06 '13 17:09 arq5x

Thanks Aaron

gr8ape avatar Sep 06 '13 18:09 gr8ape

Hi Aaron, thanks for pushing those changes, much appreciated! They've fixed the 'grab' command, but i'm still segfaulting on the random command:

$ grabix random simrep.chr1.bed.gz 10
Segmentation fault: 11

drmjc avatar Sep 08 '13 10:09 drmjc

hi mark, aaron,

I took a quick look through, but I have little experience debugging c code. I tried the previous initialization trick, but no luck this time.

The segfault in random (run on the included sample file and requesting 10 lines) occurs immediately following the last line below within the bgzf.c library, although I'm assuming the actual problem is somewhere upstream and it already has bad memory locations to work with at this point.

beyond my know-how to fix. could be another default initialization issue somewhere else in the code, but I haven't found it. otherwise not sure why this would be specific to Mark and I's systems.

In any event, hope this helps narrow down the problem.

Best, Dale

int bgzf_getline(BGZF *fp, int delim, kstring_t *str)
{
    int l;
    l = 0;
    int state;
    state = 0;
    unsigned char *buf = (unsigned char*)fp->uncompressed_block;
    str->l = 0;
    do {
        if (fp->block_offset >= fp->block_length) {
            if (bgzf_read_block(fp) != 0) { state = -2; break; }
            if (fp->block_length == 0) { state = -1; break; }
        }
        for (l = fp->block_offset; l < fp->block_length && buf[l] != delim; ++l);
        if (l < fp->block_length) state = 1;
        l -= fp->block_offset;
        if (str->l + l + 1 >= str->m) {
            str->m = str->l + l + 2;
            kroundup32(str->m);
        str->s = (char*)realloc(str->s, str->m);

}
        memcpy(str->s + str->l, buf + fp->block_offset, l);

gr8ape avatar Sep 08 '13 17:09 gr8ape

FYI I installed grabix with conda install -c conda-forge -c bioconda grabix on CentOS7 and I cannot reproduce the issue. It's either been fixed or the bioconda folks use a good compiler. Either way, it can solve your problem.

janxkoci avatar Apr 01 '22 12:04 janxkoci