gatb-core icon indicating copy to clipboard operation
gatb-core copied to clipboard

Question regarding queryAbundance(node) function.

Open ddurai opened this issue 6 years ago • 8 comments

Hello all, I am using GATB library for my work. I have a naive question which I have not been able to find an answer for. While using one of the functions to find the abundance of a node (queryAbundance(node)) in the initial dataset, I always get the abundance as 255 for any kmer having abundance greater than 255. Upon looking at the gatb-core source code I found the following template in Graph.cpp:

template<typename Node, typename Edge, typename GraphDataVariant> struct queryAbundance_visitor : public boost::static_visitor { Node& node; queryAbundance_visitor (Node& node) : node(node){} template<size_t span> int operator() (const GraphData& data) const { unsigned long hashIndex = getNodeIndex(data, node); if(hashIndex == ULLONG_MAX) return 0; // node was not found in the mphf unsigned char value = (*(data._abundance)).at(hashIndex); return value; } };

The template is returning an unsigned char value which has a range of 0 to 255. This might be a reason why the value is always 255. But even if I change it to "unsigned int", I still get the same result. Inspired from the snippet given in GATB library (deBruijn26.cpp), I am using the query abundance function as follows:

std::string s = model.toString (itKmer->value()); //itkmer->value is the kmer of length 21 from a read const char* sq = s.c_str(); Node node = graph.buildNode(sq); auto abund = graph.queryAbundance(node);

Am I making a mistake here in my above code?

regards Dilip

ddurai avatar Jan 17 '18 12:01 ddurai

Hi Dilip! yep, thanks for reporting it (and seen your email); will reply shortly

rchikhi avatar Jan 17 '18 15:01 rchikhi

Thank you. (I apologies for bothering you with multiple mails )

ddurai avatar Jan 18 '18 09:01 ddurai

hi Dilip! should be fixed now in the latest master commit (https://github.com/GATB/gatb-core/commit/687629e82c42cdd5c928dcfb3b13823a8abc1789). Previously, abundances in the Graph were always recorded with values at most 255. Guillaume had partly implemented quantized abundance encoding, so that abundances larger than 255 are now encoded with some fixed error. I've activated this feature in the Graph.

If you really need high precision abundances: right now the abundance is fixed at 8-bit precision but I could point to you what to change in the code to have it at a higher precision (it's basically a matter of changing a uint8 to a uint16 somewhere).

In addition, I added a unit test (TestDebruijn::debruijn_large_abundance_query which can be run with CPPUNIT_VERBOSE=1 bin/gatb-core-cppunit TestDebruijn)

rchikhi avatar Jan 25 '18 23:01 rchikhi

Thank you for the fix rayan. Some more details about how abundance is encoded with this 8-bit scheme :

  • from 0 to 70 : step = 1
  • from 70 to 100 : step = 2
  • from 100 to 500 : step = 10
  • from 500 to 1000 : step = 20
  • from 1000 to 5000 : step = 100
  • from 5000 to 10000 : step = 200
  • from 10000 to 50000 : step = 1000

Up to 70 the value returned is exact. Then there is some error. i.e. for example if the code gives you the value 410, the real abundance value is in the interval [405 ; 415] The max is now 50000.

rizkg avatar Jan 26 '18 10:01 rizkg

Hey Guys, Thanks a lot for the fix and the explanation. Now its more clear to me how the abundance function is working. Also at later stages of my work I would be working with abundances of high precision. So it would be good to know where to change in the code

Thanks a lot once again

ddurai avatar Jan 26 '18 14:01 ddurai

Regarding adapting the code to higher precision:

actually since discretization is in place and cannot be disabled, it is a bit harder to change the code to have precise high abundances.

  1. one would need to change these two functions

https://github.com/GATB/gatb-core/blob/963db0289bf228ce004d0cf1521913d22e28151d/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L213

https://github.com/GATB/gatb-core/blob/963db0289bf228ce004d0cf1521913d22e28151d/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L217

to only do return data[hash[key]] and return data[code], respectively.

  1. and that line:

https://github.com/GATB/gatb-core/blob/1682fededda092947dd59a9c18cc0b4f349d5ed9/gatb-core/src/gatb/kmer/impl/MPHFAlgorithm.hpp#L77

Abundance_t=u_int16_t

If it doesn't work, let us know.

rchikhi avatar Mar 12 '18 21:03 rchikhi

Hello,

I too got this problem and tried to solve as described. I created a fork (https://github.com/leoisl/gatb-core) since I did some other changes, but I did not manage to make it work (querying the node's abundance with Node_t::abundance gives the correct abundance, but with GraphTemplate::queryAbundance, no). For example, in TestDebruijn::debruijn_large_abundance_query, this is the output I get (added back some debug stuff):

TestDebruijn::debruijn_large_abundance_query
AAAGAACATGTGAGCAACGGGATAACGCAGG test printing node abundance with it.item().abundance: 999
AACATGTGAGCAACGGGATAACGCAGGAAAG test printing node abundance with it.item().abundance: 999
AACGCAGGAAAGAACATGTGAGCAACGGGAT test printing node abundance with it.item().abundance: 999
AACGGGATAACGCAGGAAAGAACATGTGAGC test printing node abundance with it.item().abundance: 999
AAGAACATGTGAGCAACGGGATAACGCAGGA test printing node abundance with it.item().abundance: 999
ACGCAGGAAAGAACATGTGAGCAACGGGATA test printing node abundance with it.item().abundance: 999
ACGGGATAACGCAGGAAAGAACATGTGAGCA test printing node abundance with it.item().abundance: 999
ATAACGCAGGAAAGAACATGTGAGCAACGGG test printing node abundance with it.item().abundance: 999
AGAACATGTGAGCAACGGGATAACGCAGGAA test printing node abundance with it.item().abundance: 999
AGGAAAGAACATGTGAGCAACGGGATAACGC test printing node abundance with it.item().abundance: 999
CAACGGGATAACGCAGGAAAGAACATGTGAG test printing node abundance with it.item().abundance: 999
CAGGAAAGAACATGTGAGCAACGGGATAACG test printing node abundance with it.item().abundance: 999
CCGTTGCTCACATGTTCTTTCCTGCGTTATC test printing node abundance with it.item().abundance: 999
CTGCGTTATCCCGTTGCTCACATGTTCTTTC test printing node abundance with it.item().abundance: 999
CGCAGGAAAGAACATGTGAGCAACGGGATAA test printing node abundance with it.item().abundance: 999
CGTTGCTCACATGTTCTTTCCTGCGTTATCC test printing node abundance with it.item().abundance: 999
CGGGATAACGCAGGAAAGAACATGTGAGCAA test printing node abundance with it.item().abundance: 1000
TAACGCAGGAAAGAACATGTGAGCAACGGGA test printing node abundance with it.item().abundance: 999
TTTCCTGCGTTATCCCGTTGCTCACATGTTC test printing node abundance with it.item().abundance: 999
TGCGTTATCCCGTTGCTCACATGTTCTTTCC test printing node abundance with it.item().abundance: 999
GCAGGAAAGAACATGTGAGCAACGGGATAAC test printing node abundance with it.item().abundance: 999
GTTGCTCACATGTTCTTTCCTGCGTTATCCC test printing node abundance with it.item().abundance: 999
ACATGTTCTTTCCTGCGTTATCCCGTTGCTC test printing node abundance with it.item().abundance: 999
ATGTTCTTTCCTGCGTTATCCCGTTGCTCAC test printing node abundance with it.item().abundance: 999
ATGTGAGCAACGGGATAACGCAGGAAAGAAC test printing node abundance with it.item().abundance: 999
AGCAACGGGATAACGCAGGAAAGAACATGTG test printing node abundance with it.item().abundance: 999
CATGTTCTTTCCTGCGTTATCCCGTTGCTCA test printing node abundance with it.item().abundance: 999
CATGTGAGCAACGGGATAACGCAGGAAAGAA test printing node abundance with it.item().abundance: 999
TCACATGTTCTTTCCTGCGTTATCCCGTTGC test printing node abundance with it.item().abundance: 999
TGTTCTTTCCTGCGTTATCCCGTTGCTCACA test printing node abundance with it.item().abundance: 999
ACATGTGAGCAACGGGATAACGCAGGAAAGA test printing node abundance with it.item().abundance: 999

TTGCTCACATGTTCTTTCCTGCGTTATCCCG test printing node abundance with graph.queryAbundance(node): 150. Expected abundance:1000

The first abundances were queried using Node_t::abundance and are correct. However, GraphTemplate::queryAbundance return 150 instead of 1000 for kmer TTGCTCACATGTTCTTTCCTGCGTTATCCCG. I tried to debug this, and when graph.queryAbundance(node) is called, it eventually reaches this line in MapMPHF.hpp (that I slightly modified): https://github.com/leoisl/gatb-core/blob/master/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L227

Upon inspecting the values of data when gdb gets to this line, they are incorrectly set:

(gdb) p data
$4 = std::vector of length 31, capacity 31 = {149, 149, 149, 149, 149, 149, 149, 150, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149}

However, I don't think I can progress much from here... Do you have some ideas on how to correctly fill data?

Thanks!

leoisl avatar Mar 15 '18 23:03 leoisl

Hi Leandro,

Sorry for letting this question linger for so long. '149' is actually exactly the discretized abundance code for a real abundance of 999: 999=701+152+4010+2420+19, as per https://github.com/GATB/gatb-core/issues/19#issuecomment-360741787, and summing the terms in bold, 70+15+40+24=149. So the values in data are correctly set. If you're getting 150 instead of 1000 (where 150 is actually the correct encoding for 1000), something is wrong with the decoding of the discretized abundance. If you're still interested in figuring out this bug, I'd recommend checking for wrong memory accesses using valgrind.

rchikhi avatar Dec 25 '18 20:12 rchikhi