qiita icon indicating copy to clipboard operation
qiita copied to clipboard

dbBact's wordclouds for Qiita

Open sjanssen2 opened this issue 5 months ago • 1 comments

In my collaborations, I often encounter situations where PhD candidates are volunteered by their PI's to also handle amplicon analysis, but are total microbiome newbies. Situations might be complicated, because others might have done sample collection, sequencing was outsourced, ... Without any experience, they now have to sanity check if the sequencing was successful ... any PI's too often love to take shortcuts, like expired and flowcells, excessive multiplexing, ... Sanity checking is extremely hard if not impossible without having seen many OKish datasets. I was recently contacted by Amnon and colleagues who finally published dbBact. In a nutshell: they collect expert knowledge for individual ASV sequences. I found their wordclouds (i.e. enrichted terms of ASVs in a feature-table - or more precisely: rep set) extremely helpful to characterize a sequencing run / prep without relying on the metadata, which are sadly too often wrong or incomplete. It is quite easy to see if a prep holds samples from e.g. mouse or soil or ....

Therefore, I'd like to integrate these wordclouds into Qiita and wonder what the best strategy is? Here are my thoughts:

  • In principle, I assume minimal knowledge / experience of the user and therefore intend to present these images prominently without much action required by the user.
  • dbBact is a database and will change over time, how can we assert reproducibility?
  • Amnon already created an API endpoint to which we can post a set of ASV sequences and receive F-score for the terms that make up the word cloud. This is handled by their server. Is it OK to rely on this external resource or better have a DB dump or flatfile in a plugin?
https://dbbact.org/sequences_fscores

to use it, just supply the json parameter 'sequences' which is a list of the sequences (ACGT string that start from one of the supported dbbact regions)

example:
seqs=['TACGGAGGGTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTTTAAGTCTGGGGTGAAAGCCCGTTGCTCAACAACGGAACTGCCCTGGAAACTGGAGAGCTTGAGTACAGACGAGGGTGGCGGAATGGACGG']
res=requests.get('https://dbbact.org/sequences_fscores',json={'sequences': seqs})
print(res.json())
  • we could blow up the qp-deblur plugin to also produce wordclouds, but I am rather hesitant, as they are not really a processing result
  • if we implement the word-clouds as a new plugin, how can we ensure that users will actually perform this action? Maybe as part of a default workflow?
  • is it worth to cache the generated images or would on demand API calls suffice?
  • where in Qiita to present the results? Here are three suggestions
    1. show a word cloud prominently at the study summary page for every 16S/18S prep that has been processed with deblur: image
    2. show one word cloud below every 16S/18S prep that has been processed with deblur: image
    3. strictly sticking to the plugin architecture and show as visualization summary for the according artifact: image
  • how would we handle database updates?
    • deprecate the existing word clouds and automatically generate new ones?
    • present a sync option to the user to manually trigger updates?

I'd be happy to know your opinion @antgonza before I start implementing. Thank you!

sjanssen2 avatar Mar 03 '24 21:03 sjanssen2