joss-reviews [REVIEW]: btllib: A C++ library with Python interface for efficient genomic sequence processing

Submitting author: @vlad0x00 (Vladimir Nikolic) Repository: https://github.com/bcgsc/btllib Branch with paper.md (empty if default branch): paper Version: v1.4.4 Editor: @fboehm Reviewers: @pjb7687, @MaximLippeveld Archive: Pending

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/685dffcfe8c91237f914eed277c7a31a"><img src="https://joss.theoj.org/papers/685dffcfe8c91237f914eed277c7a31a/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/685dffcfe8c91237f914eed277c7a31a/status.svg)](https://joss.theoj.org/papers/685dffcfe8c91237f914eed277c7a31a)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@pjb7687 & @MaximLippeveld, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fboehm know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

📝 Checklist for @MaximLippeveld

📝 Checklist for @pjb7687

Aug 31 '22 13:08 editorialbot

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

Aug 31 '22 13:08 editorialbot

Software report:

github.com/AlDanial/cloc v 1.88  T=0.40 s (912.4 files/s, 248611.4 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
C++                              38           5059            612          47852
HTML                            206            921            796          31496
C/C++ Header                     24            787           1856           4320
CSS                               3            342             54           1709
JavaScript                       57            104            145           1278
YAML                              4             26              7            368
Bourne Again Shell               13             83             15            315
SWIG                              3             49              1            269
Python                            4             99             35            198
Meson                             6             48             16            163
Markdown                          3             22              0            137
SVG                               3              1              1            129
TeX                               1             15              0            125
--------------------------------------------------------------------------------
SUM:                            365           7556           3538          88359
--------------------------------------------------------------------------------


gitinspector failed to run statistical information for the repository

Aug 31 '22 13:08 editorialbot

Wordcount for paper.md is 1023

Aug 31 '22 13:08 editorialbot

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/bioinformatics/btp163 is OK
- 10.1093/bioinformatics/btw397 is OK
- 10.1073/pnas.1903436117 is OK
- 10.1093/bioinformatics/btaa253 is OK
- 10.1186/s12859-021-04451-7 is OK
- 10.3390/dna2020009 is OK
- 10.1093/gigascience/giz109 is OK

MISSING DOIs

- 10.1016/j.jbiotec.2017.07.017 may be a valid DOI for title: The SeqAn C++ template library for efficient sequence analysis: A resource for programmers

INVALID DOIs

- None

Aug 31 '22 13:08 editorialbot

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Aug 31 '22 13:08 editorialbot

Review checklist for @MaximLippeveld

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/bcgsc/btllib?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@vlad0x00) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Sep 02 '22 11:09 MaximLippeveld

Review checklist for @pjb7687

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/bcgsc/btllib?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@vlad0x00) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Sep 02 '22 15:09 pjb7687

I have finished my initial review. I like the overall quality of the software and the paper :+1: I did not check the 'reproducibility' and 'performance' boxes yet.

Concerning the reproducibility, I was not able to find how to reproduce figure 1A and 1B from the paper. Are there any instructions I missed?

Concerning the performance claims, it is stated that ntHash implementation is "A very efficient DNA/RNA rolling hash function, an order of magnitude faster than the best performing alternatives in typical use cases". No evidence seems to be given to support this claim.

Sep 05 '22 07:09 MaximLippeveld

I have also finished my initial review. For me, the library seems to be a dedicated library for internal use rather than a general bioinformatics library. But I think the code quality is overall very good and the source code is well maintained on Github. The package will be useful for some use cases, e.g. sequencing data I/O or some algorithms based on bloom filters. There could be some other groups interested in using the package, at least thanks to its fast I/O speed.

However, there are several checkboxes I didn't tick. Below I explain the concerns.

One of the main improvements that the authors claimed is the Python wrapper, but there seems to be no Python API documentation. (Functionality documentation)
Not all functions seem to be exported to Python. For example, I can't find "CountingBloomFilter" in Python library. (Functionality)
The authors claim that btllib is faster than Heng Li's kseq (Figure panel B). That's nice, would be nice if the authors provide the code used to test it. (Reproducibility and Performance)

I also have a minor concern:

There are only 3 examples available. It would be really nice if the authors add more.

Sep 05 '22 08:09 pjb7687

@vlad0x00 - the reviewers have completed their initial reviews. Please feel free to reply to their concerns in this thread.

Sep 07 '22 13:09 fboehm

Thank you @MaximLippeveld and @pjb7687 for your reviews! Here are my responses to your concerns:

Concerning the reproducibility, I was not able to find how to reproduce figure 1A and 1B from the paper. Are there any instructions I missed?

I am currently in the process of uploading the necessary code to reproduce the figures, I'll post here once the code is available in the btllib repository.

Concerning the performance claims, it is stated that ntHash implementation is "A very efficient DNA/RNA rolling hash function, an order of magnitude faster than the best performing alternatives in typical use cases". No evidence seems to be given to support this claim.

The claim is from the ntHash publication (https://academic.oup.com/bioinformatics/article/32/22/3492/2525588), which has been cited in the btllib manuscript. The btllib library currently has the latest implementation of the ntHash function. This hash function is still actively developed and there have been further speed improvements in the latest publication: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac564/6674501?login=false

One of the main improvements that the authors claimed is the Python wrapper, but there seems to be no Python API documentation. (Functionality documentation)

In the btllib repository README, we state: "The wrappers correspond one-to-one with C++ code so any functions and classes can be used under the same name. The only exceptions are nested classes which are prefixed with outer class name (e.g. btllib::SeqReader::Flag in C++ versus btllib.SeqReaderFlag in Python), and (Kmer)CountingBloomFilter which provides CountingBloomFilter8, CountingBloomFilter16, CountingBloomFilter32, KmerCountingBloomFilter8, KmerCountingBloomFilter16, CountingBloomFilter32 with counters 8, 16, and 32 bits wide." So the idea is to have the same functionality under the same names as in C++, and this has guided the naming of classes and functions in the library.

Not all functions seem to be exported to Python. For example, I can't find "CountingBloomFilter" in Python library. (Functionality)

Thank you for spotting this. I looked into it and noticed counting Bloom filters weren't properly wrapped. This has been fixed in the latest (v1.4.5) version.

The authors claim that btllib is faster than Heng Li's kseq (Figure panel B). That's nice, would be nice if the authors provide the code used to test it. (Reproducibility and Performance)

I will be uploading code to reproduce the figures in the manuscript. Once there, I'll make another comment.

There are only 3 examples available. It would be really nice if the authors add more.

We will be adding more examples as we develop the library further. Currently, the users can refer to the documentation of classes/functions to learn how to use them.

Sep 09 '22 11:09 vlad0x00

Thank you, @vlad0x00 . @pjb7687 and @MaximLippeveld - please feel free to continue the conversation with the author in this thread. Thanks again!

Sep 09 '22 13:09 fboehm

Hi!

I've added the code to reproduce the figures in the paper branch of btllib repo: https://github.com/bcgsc/btllib/tree/paper The code can be found in benchmarks directory. It was tested under a Linux system and requires a recent gcc compiler, and Python and R to plot the figures. The code tests btllib with up to 128 threads so to properly reproduce the results your machine would need at least that many CPUs. If you're unable to reproduce up to that many threads, you can edit benchmarks/run-test-cbf and benchmarks/run-test-input scripts and remove the thread counts you don't want to test. To run the code, from within benchmarks dir, run make. This will run the code that generates the data and plots the input_plot.png (SeqReader vs kseq) and cbf_plot.png (CoutingBloomFilter) figures.

Since my last comment, I've also updated the figures slightly. Previously generated figures used local data, so instead I found an online dataset to regenerate the figures. This online dataset is automatically downloaded by the Makefile in benchmarks dir (requires about 13GiB). This has slightly changed the numbers, but the conclusions are the same. SeqReader and kseq now run a simulated workload of 2ms per read in order to properly highlight the differences. I've also reduced the number of loaded reads for CountingBloomFilter from 250,000 to 50,000 since the previous test was taking too long, as well as reducing the size of the CountingBloomFilter from 10GiB to 3GiB.

Sep 16 '22 10:09 vlad0x00

Just a quick comment -- I updated the scripts in paper branch to download the latest btllib version. The one it downloaded previously had some missing files which could've caused compile issues. If you successfully compiled it with the previous paper branch, that's alright, the code should've worked fine, it was just a compilation-related issue.

Sep 20 '22 21:09 vlad0x00

@MaximLippeveld and @pjb7687 - once you've had a chance to check the revisions from the authors, please feel free to update the review checklists and to comment here with any additional revisions needed. Thanks again!

Sep 23 '22 00:09 fboehm

I've run the benchmarks.

I had to change below in Makefile:

test_kseq: test_kseq.cpp
	g++ -std=c++17 -fopenmp -pthread -march=native -O3 -flto -lz -lm $< -o $@ -I/projects/btl/vnikolic/miniconda/include -L/projects/btl/vnikolic/miniconda/lib

to:

test_kseq: test_kseq.cpp
	g++ -std=c++17 -fopenmp -pthread -march=native -O3 -flto $< -lz -lm -o $@

so that it works properly. I've also installed all required Python and R packages.

Unfortunately, it takes forever for me to download ERR6668574 from SRA, I had to run the benchmarks with a FASTQ file that I have.
I mainly focused on reproducing the second panel of the figure in the paper, which shows the performance improvement of btllib over kseq. Here is the result:

First round: input_plot

Secound round: input_plot_2

It seems that I couldn't reproduce the figure. I have run the benchmarks twice and have the same result.

The benchmarks seem to measure the total running time of the benchmark programs, not solely the I/O time taken by each algorithm. I noticed that the kseq benchmark code is longer, and I suspect that it was the main reason the btllib seemed to be faster in the benchmark performed by the authors. I think in my case it was negligible since I use quite fast CPUs (2x Intel Xeon 5220R).
Finally I don't fully understand why it is important to test file I/O with such a lot of threads. In my opinion, usually file I/O is capped by the disk I/O speed not the number of threads.

Sep 24 '22 15:09 pjb7687

Hi @pjb7687 ,

Thank you for additional feedback! To address your concerns:

I had to change below in Makefile:

Thank you for spotting this, I've removed the misplaced include and library paths in the paper branch Makefile.

Unfortunately, it takes forever for me to download ERR6668574 from SRA, I had to run the benchmarks with a FASTQ file that I have.

I've noticed that downloading sequences from there sometimes takes a bit and is fast at other times.

It seems that I couldn't reproduce the figure. I have run the benchmarks twice and have the same resul

If your sequences aren't particularly long, the effects won't be visible since the time spent in the critical section performing I/O will be short. kseq is very efficient (hence why it's used for comparison) and so when working with short sequences kseq and SeqReader will perform about the same. Are you able to calculate N50 of the sequences and post it here? E.g. you can use QUAST. The sequences used to generate the figure in the manuscript have an N50 of 41,459.

The benchmarks seem to measure the total running time of the benchmark programs, not solely the I/O time taken by each algorithm. I noticed that the kseq benchmark code is longer, and I suspect that it was the main reason the btllib seemed to be faster in the benchmark performed by the authors. I think in my case it was negligible since I use quite fast CPUs (2x Intel Xeon 5220R).

SeqReader is likely the same as kseq in I/O performance alone, since kseq is implemented very efficiently. What SeqReader does have is an efficient implementation of thread safety when multiple threads read from it. kseq is not threadsafe by default, which is why it needs a bit more code than SeqReader in order for multiple threads to be able to read from it. A modern bioinformatics application that does a lot of processing on large amounts of data is likely to benefit greatly from using multiple threads, especially if it wants to be scalable.

Finally I don't fully understand why it is important to test file I/O with such a lot of threads. In my opinion, usually file I/O is capped by the disk I/O speed not the number of threads.

Often processing sequences can be parallelized at the level of each sequence, i.e. each sequence is assigned to a thread. A common pattern in my lab's code is to have multiple threads load sequences and then, e.g., store them in a Bloom filter or extract information that might be used for assembly or scaffolding. SeqReader allows you to do this efficiently and, maybe more importantly, with very little code.

I would like to add, however, that the performance aspect is not the most important part of SeqReader. Its most important benefit is ease of use (intuitive syntax and out-of-the-box thread safety), and flexibility (can read many compressed formats and multiple sequence formats (FASTA, FASTQ, SAM). If you think this figure does not add much, I might as well remove it from the manuscript.

Sep 24 '22 21:09 vlad0x00

@MaximLippeveld and @pjb7687 - Thank you for the thorough discussions! Please see the replies from @vlad0x00 above and feel free to continue the conversation. Please let me know if there is anything that I can do to help. Thanks again!

Sep 26 '22 15:09 fboehm

Thanks for uploading the code to reproduce the figures. I am able to reproduce figure 1A up to 64 threads. I don't have access to a bigger machine.

Figure 1B is for me also not showing a significant advantage for SeqReader, but I would like to rerun it to make sure. In any case, I think figure 1B should be added to the paper as it at least shows that SeqReader is on par with kseq, which is also good to know.

Since you mention performance is not the most important part, maybe you could also provide a code sample in the paper to show the intuitive syntax? @fboehm are code samples common in JOSS papers?

Sep 26 '22 19:09 MaximLippeveld

I've noticed that downloading sequences from there sometimes takes a bit and is fast at other times.

I tried it again yesterday and successfully downloaded the fastq file.

If your sequences aren't particularly long, the effects won't be visible since the time spent in the critical section performing I/O will be short. kseq is very efficient (hence why it's used for comparison) and so when working with short sequences kseq and SeqReader will perform about the same. Are you able to calculate N50 of the sequences and post it here? E.g. you can use QUAST. The sequences used to generate the figure in the manuscript have an N50 of 41,459.

Now I clearly understand the point. My fastq file only contains short reads as it is conventional Illumina sequencing data. I'll test it with the downloaded sequencing data.

I would like to add, however, that the performance aspect is not the most important part of SeqReader. Its most important benefit is ease of use (intuitive syntax and out-of-the-box thread safety), and flexibility (can read many compressed formats and multiple sequence formats (FASTA, FASTQ, SAM). If you think this figure does not add much, I might as well remove it from the manuscript.

Now I believe that it is actually a pretty good point, and the figure should be included in the manuscript. I also would like to ask the authors to add just a short sentence that clearly describes this point, as SeqReader can be particularly useful when one has to deal with long-read data in many threads.

I will post again after I finish the test.

Sep 27 '22 01:09 pjb7687

@MaximLippeveld - yes, short examples of code are fairly regular features in JOSS articles.

Sep 27 '22 13:09 fboehm

Yes, now I can reproduce the figure.

As I mentioned in the above comment, I would like to kindly ask the authors to add just one sentence to clarify this. I do not have any further concerns.

Sep 29 '22 14:09 pjb7687

Hi!

I've added an example of how to use SeqReader in the manuscript, as well as two examples for ntHash and Counting Bloom Filter. Additionally, I've included a sentence that clarifies SeqReader's use case:

"SeqReader performs particularly well if used by multiple threads when loading long reads, as their length increases the time spent in I/O and thus in the critical section as well."

The paper branch has the latest manuscript, and the following commit shows the changes I've made: https://github.com/bcgsc/btllib/commit/1ad3717c08b85ce26eb5a78b106320943525672a

Let me know if you require any other changes.

Cheers! Vladimir

Sep 30 '22 21:09 vlad0x00

@MaximLippeveld - please feel free to continue the conversation once you've reviewed the latest changes from the authors.

Thanks again!

Oct 02 '22 21:10 fboehm

@editorialbot generate pdf

Oct 03 '22 07:10 MaximLippeveld

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Oct 03 '22 07:10 editorialbot

Thanks for adding the code samples. I have one final remark concerning the Design & Implementation section. It is not very clear to me if the listed algorithms and datastructures are reimplementations of existing software or if they are all original pieces of code. I assume the Bloom and Counting Bloom filter and Sequence I/O are original implementations, and ntHash, Multi-index bloom filter and Indexlr are reimplementations. Is this correct? If they are reimplementations, were the algorithms only ported from one language to another, or were they also optimized for scalability?

Oct 03 '22 08:10 MaximLippeveld

Except for sequence I/O, all of the code in btllib started out in different projects in our lab. To make it easier to reuse, we moved the commonly used code into btllib which now has the reference/latest implementations for the listed algorithms. The code wasn't just moved, though, in general most of it had some improvements in efficiency, portability, ease of use, or readability. For some of the algorithms/data structures, the original project is cited in btllib manuscript (ntHash, Multi-index Bloom filter, Indexlr), whereas for others that weren't published it's not. E.g. counting Bloom filter wasn't published, our lab simply had it in its own repository until it was moved to btllib.

Let me know if you need further explanation or if I should modify the manuscript text to reflect any of this.

Oct 03 '22 16:10 vlad0x00

Ok, thanks for clarifying. Yes, it would be nice if this is reflected in the manuscript. Perhaps you could add in the Design & Implementation section that implementations for ntHash, Multi-index Bloom filter and Indexlr were adapted from previously existing work. Other than that, I have no further remarks!

Oct 05 '22 06:10 MaximLippeveld

Hi!

I've updated the paper branch to clarify code origin from the respective publications. You can see the changes in the following commit: https://github.com/bcgsc/btllib/commit/8eb7f38501145e166b35b536ddd90d80c7f219ed

Let me know if this is satisfactory and if there are other changes I need to make.

Cheers! Vladimir

Oct 06 '22 05:10 vlad0x00

joss-reviews joss-reviews copied to clipboard

[REVIEW]: btllib: A C++ library with Python interface for efficient genomic sequence processing

Status

Reviewer instructions & questions

Checklists

Review checklist for @MaximLippeveld

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @pjb7687

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

joss-reviews
joss-reviews copied to clipboard