kalign icon indicating copy to clipboard operation
kalign copied to clipboard

Library and Public Headers

Open Robin-Rounthwaite opened this issue 1 year ago • 12 comments

Hello! Would you be able to add a library and public headers to kalign, so that I can use your aligner from within my cpp project?

I'm developing a tool within VG that simplifies certain kinds of complex snarls by extracting the haplotypes they represent and realigning them. The goal is to improve mapping accuracy and variant calling in these snarls.

Kalign seems like a good fit for the job, since there are many of these snarls, and there are frequently >100 unique haplotypes in a single MSA.

But the only practical way to integrate kalign's MSA into VG requires kalign to be a library. Would that be something you are willing to add?

Robin-Rounthwaite avatar Aug 09 '22 22:08 Robin-Rounthwaite

Hi, I believe this should be fairly straight forward to implement. The main library function would take as input a char** for the sequences, int* for sequence lengths and optionally a DNA / Protein argument and return the aligned sequences (char** ) and the alignment length. Is this all that you would need?

TimoLassmann avatar Aug 10 '22 00:08 TimoLassmann

It is! That would be excellent, thank you.

Robin-Rounthwaite avatar Aug 10 '22 18:08 Robin-Rounthwaite

This is exactly what I would need. Unfortunately, my project is currently blocked by this task. Is this something you would be able to implement in the next week or two? Or would I need to set this up myself?

Robin-Rounthwaite avatar Aug 10 '22 18:08 Robin-Rounthwaite

Oh dear deadlines ;) I should be able to pull something together within the next week. I might take this opportunity to switch to cmake...

TimoLassmann avatar Aug 11 '22 00:08 TimoLassmann

Thank you!

Robin-Rounthwaite avatar Aug 11 '22 21:08 Robin-Rounthwaite

Hi, I added a new branch "cmake" and added a kalign library with the function: int kalign(char **seq, int *len,int numseq, char ***aligned, int *out_aln_len); Also included is a small test program illustrating how to use this (lib_test.c).

Two small remaining issues:

  1. kalign used lower terminal gap extension penalties - for your application it might be more reasonable to set these equal to internal penalties
  2. currently no threads are used when calling kalign via the library function.

Hope this addresses your query.

TimoLassmann avatar Aug 22 '22 00:08 TimoLassmann

Thank you for your kind help. I believe my query is addressed.

Robin-Rounthwaite avatar Aug 23 '22 18:08 Robin-Rounthwaite

Great. Let me know what works, what doesn't and if there is a need to change the default gap extension parameters. Happy aligning! /T

TimoLassmann avatar Aug 24 '22 00:08 TimoLassmann

Hi Dr. Timo Lassman, I realize that I do in fact need to do something like change the default gap extension parameters. In short, I need to somehow prohibit gaps at the end of alignments.

For example, GATTACA GATTA - - is forbidden, but GATTACA GATT - - A is fine.

Similarly, GAGTACA

    • GTACA is forbidden, but GAGTACA G - - TACA is fine.

There's a couple ways of doing this. Being able to raise the terminal gap extension cost prohibitively high should work great, and it sounds like there's an easy way to do that.

Alternatively, I could replace the start and end of the string with special characters that have an extremely high match score, e.g. the input strings GATTACA GATTA become XATTACX XATTX, and are forced to align as XATTACX XATT - - X . But I don't think I'm permitted to make character-specific match scores right now.

Would you be willing to enable me to use one of these methods?

Robin-Rounthwaite avatar Aug 27 '22 18:08 Robin-Rounthwaite

Multithreading the alignment may also be important for my project. It would certainly be helpful. I'm currently switching from Seqan/T-Coffee to Kalign because the former is too slow for my purposes.

I should be close to fixing some issues I had with linking Kalign to VG. Once I'm done with that, I'll measure how important the speedup from multithreading would be - whether it would be essential or merely helpful.

Robin-Rounthwaite avatar Aug 27 '22 18:08 Robin-Rounthwaite

Hi, I just saw your messages. I can address these queries by next week without too much hassle. Thanks, T

TimoLassmann avatar Sep 01 '22 01:09 TimoLassmann

Thank you!

After poking around with integrating liblibkalign into VG, I have some general feedback for the library. Also, I just recently realized that I don't necessarily need a prohibition on the end-gaps. Having an equivalent terminal gap score to the internal gap score would be fine.

(This is because I can simply crop off the first and last character of the two strings being aligned, and then re-add them after the fact. This preserves terminal characters on the ends of the strings.)

With regards to liblibkalign:

  • There's a couple problems with including the library.
  1. The first has to do with including C code in a cpp project, and is easy enough to fix. Basically, wrapping the library functions in the following will make using the functions seamless:
#ifdef __cplusplus
extern "C" {
#endif

#ifdef __cplusplus
extern "C" {
#endif
It's also fairly easy to work around. If the snippets are not already in the library, I just have to include similar code to wherever I call your function.
  1. Secondly, it looks like you've created 7 independent .a files for all the kalign subdirectories that the user would have to link to get it to work. Unless if you want to ship a bunch of little libraries like libmodule_msa.a, it would be better to have the CMake setup use an object file collection library to make one actually-useful liblibkalign.a with all the pieces I really need, rather than bringing all 7 .a files up through to vg.

    To do this, you would go to all the individual subdirectory CMake files and change the lines like this to use OBJECT instead of STATIC I think, for all those little modules.

    My thanks to Dr. Adam Novak for helping me figure out this feedback.

  • To reiterate what I've already said in previous messages:
    • I will greatly appreciate the ability to set the terminal gap extension cost to be the same as the internal gap extension cost.
    • Adding multithreaded alignment will also be helpful.

Wishing you well.

  • Robin

Robin-Rounthwaite avatar Sep 01 '22 17:09 Robin-Rounthwaite

Hi,

I re-organized the kalign code in the cmake branch.

There is now a shared kalign library which can be installed and used like this:

if (NOT TARGET kalign::kalign)
  find_package(kalign)
endif()

Alternatively you can add the library as a module like this:

if (NOT TARGET kalign::kalign)
  add_subdirectory(${PROJECT_SOURCE_DIR}/lib build EXCLUDE_FROM_ALL)
endif()

I tested the above and a minimal working examples is in tests/kalign_lib_testCXX.cpp (my first c++ program!).

The main library interface function is:

int kalign(
                  char **seq,
                  int *len,
                  int numseq,
                  int n_threads, 
                  int type, (see below) 
                  float gpo,
                  float gpe,
                  float tgpe,
                  char ***aligned,
                  int *out_aln_len);

The type argument can be used to set some standard alignment parameters:

KALIGN_DNA: match: 5, mismatch: -4, gap open 8, gap extension 6, terminal gaps 0 KALIGN_DNA_INTERNAL: match: 5, mismatch: -4, gap open 8, gap extension 6, terminal gaps 8 KALIGN_RNA: default parameters for aligning RNA sequences

You can also set these penalties by passing non-negative numbers to gpo, gpe and tgpe arguments. Setting these parameters will override penalties set via the alignment types.

You can enable multi-threading by setting the `n_threads' argument. However, I would recommend to test whether kalign with a single thread is fast enough for your purposes....

Let me know if this works for you and if you have any other comments or suggestions.

Thanks, T

TimoLassmann avatar Sep 08 '22 09:09 TimoLassmann

Thank you! This sounds excellent. I'll keep you updated. And congrats on your C++. It's a fun language! A nice middle ground of convenience and efficacy.

Robin-Rounthwaite avatar Sep 08 '22 22:09 Robin-Rounthwaite