single-cell-best-practices
single-cell-best-practices copied to clipboard
Changed regex for calculation of percent hemoglobin genes
Dear Theis lab,
thank you a lot for your very helpful book and tutorials.
I am currently performing my first analysis of scRNAseq data. During step 6.3 (filtering low quality reads) I wanted to understand the regex for filtering hemoglobin genes ("^HB[^(P)]"). I noticed that this regex not only includes hemoglobin-genes, but also the genes HBEGF, HBS1L, and HBP1.
I was trying to find a more specific regex to match only the hemoglobin genes, with some help from stackoverflow. I'd suggest "^HB(?!EGF|S1L|P1).+"
, which I changed in the jupyter notebook, an alternative might be "^HB[^(P|S)]($|[^G])"
.
This applies to human data, however we briefly confirmed that these regexs are applicable (with lowercase characters) to mouse data, too.
Please correct me if I am wrong and the original regex performs in the way intended by you. In this case, I would suggest extending the documentation for clarification.
Best,
Kristina
edit: added code backticks to the suggested regexs for correct display
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Dear @KriBaLin ,
thank you!
^HB(?!EGF|S1L|P1).+
seems a bit specific and I'm worried that there might be other genes that we're not excluding as False Positives here. Is this an unjustified fear by me?
So ^HB[^(P|S)]
(which I think is equivalent to ^HB[^PS]
?) might be a more appealing option if this is the case. Note that this would still match HBEGF
...
What do you think?
Dear @Zethson,
thank you for your fast reply.
Sorry, there was a formatting mistake in my first post that turned the suggested "^HB[^(P|S)]($|[^G])"
into a wrong "^HB[^(P|S)]"
- I edited the post now.
Regarding the expression being too specific, I'm honestly not experienced enough to judge this with respect to future changes of gene annotations or the like. Currently, when I search the 36601 genes of my human data set for genes starting with "HB", I get 13 hits: HBEGF, HBS1L, HBP1, HBB, HBD, HBG1, HBG2, HBE1, HBZ, HBM, HBA2, HBA1, HBQ1;
The first 3 don't seem to be hemoglobin-genes. The regex "^HB[^(P)]"
only excludes HBP1, whilst "^HB(?!EGF|S1L|P1).+"
and "^HB[^(P|S)]($|[^G])"
exclude the first three. The former regex might be a bit easier to understand.
A (maybe more robust?) option could be to explicitly check for a list of hemoglobin genes - as suggested by Konrad Rudolph on stackoverflow.
Guess one could look at Ensemble gene symbols to see how this regex would affect it. A list of genes is also possible but then we'd need the list ^_^
I agree with @klmr that an explicit list is preferable over a regex. Not sure what a "trusted source" of hemoglobin genes would be, but results 1-10 from this genescards search is probably a good start. At least it for sure doesn't include anything unexpected.
Thank you very much @grst for the link! We'll make the changes accordingly using the list.
I found this helpful, but would like to note for anyone else looking around for this that the pseudogenes in the mouse genome are denoted with a -p (e.g., Hba-ps4). I went with "^Hb[abdegmqz]-(?!p)|Hb[abdegmqz][0-9][a-z]"
to match this list
I am planning to use "^HB[ABDEGMQZ]\d*(?!\w)"
to match human. It seems like there are set alphabetic Greek letters for hemoglobin possibly followed by a number.