fastDNA
fastDNA copied to clipboard
print-word-vectors on kallisto branch
Hi,
i am trying your software on the kallisto branch (which seems very promising) and have a couple of questions:
-
at the moment, the need to build and load in memory a kallisto index for training can (very) quickly become unusable for larger DBs due to RAM limitations. Do you have any plans on improving / changing that part ?
-
Whereas most of of the fastdna methods takes a loadIndex parameter (on the kallisto branch), the print-word-vectors does not. I just want to make sure that the embeddings outputted by this method are the contig embeddings presented in the related paper.
Thanks
Hello @manock,
Those are 2 very good points! Indeed the kallisto index becomes very large for larger DBs, in fact on the large dataset of the paper I was not able to build the index for 17 and 19-mers. Did you manage to build one for k=31? The index holds into memory but RAM overflows when building it. Maybe a solution could be to build several de Bruijn graphs on chunks of the data then merge them. I am currently trying it out with larger datasets, if I find solutions I will tell you.
Indeed the print-word-vectors function is not yet implemented on the kallisto branch. I will try to push something in the next couple of days.
Thank you for the feedback,
Romain
On 24 Apr 2020, at 14:47, manock [email protected] wrote:
Hi,
i am trying your software on the kallisto branch (which seems very promising) and have a couple of questions:
at the moment, the need to build and load in memory a kallisto index for training can (very) quickly become unusable for larger DBs due to RAM limitations. Do you have any plans on improving / changing that part ?
Whereas most of of the fastdna methods takes a loadIndex parameter (on the kallisto branch), the print-word-vectors does not. I just want to make sure that the embeddings outputted by this method are the contig embeddings presented in the related paper.
Thanks
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rmenegaux/fastDNA/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG32ANLFF2BNFHHDVY7SZDROGC4TANCNFSM4MQCHBMA.
Hi,
Did you manage to build one for k=31?
I gave up building an index on bacterial genomes. I am trying on virus genomes, which are much smaller. However, the predictions made are always the same with probability very close (in a 1e-5 range). I tried removing the predicted species, change some parameters (k), but always have the problem. When I tried the print-word-vectors function, I noticed the provided embeddings were all very close. Could it be related to this problem ?
I will try to push something in the next couple of days.
Great, I have a good use for embeddings.
Thanks.