mixcr
mixcr copied to clipboard
How to use tag pattern with unknown linker sequence?
Hi,
I am trying to use the tag-pattern option and I would like to to something like the following:
--tag-pattern '^(CELL:N{8})NNUC:N{49}(UMI:N{12})(R1:)^(R2:)'
Where NNUC is made of 49 nucleotides. In reality it is made of the repetition of a linker and the cell barcode, but to simplify I would like to use it as a sequence of 49 nucleotides.
Is there a way to do this?
Thanks a lot in advance
best, Mathilde
Hi Mathilde,
There are two ways to do that depending of what do you want to get at the end:
-
Simply specify 49 nucleotides in between
CELL
andUMI
without being able to export these nucleotides:--tag-pattern '^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*)'
-
If you want to have an access to this group at the end, since you have mentioned that this group is related to CELL barcode, you can specify it as another
CELL
barcode.--tag-pattern '^(CELL:N{8})(CELL2:N{49})(UMI:N{12})(R1:*)\^(R2:*)'
Note: the name of the group should start with
CELL
, and that is the only constrain here. So you can name it:(CELL_NNUC:N{49})
as well.PS. Since this added group
(CELL2:N{49})
is denoted as aCELL
barcode it will be used as one in the analysis. But if I understand you correctly it is in fact aCELL
barcode (meaning its sequence differs between cells and should be equal within a cell) so that should not be a problem. Nevertheless, in one of the upcoming updates we will add a feature that will allow to specify any other named group which will not be used by any algorithms but will be available for export in the output.
Hi,
Thanks a lot for you answer this is very helpful! Actually the exact pattern would be
--tag-pattern '^(CELL:N{8})N{16}(CELL:N{8})N{16}(CELL:N{8})N{1}(UMI:N{12})(R1:*)\^(R2:*)'
Do you think I can use the same group name 'CELL' several times? I do not have more information about the R1, and I guessed it should be the same CELL barcode repeated several times.
But to be safer and to make it easier for you algorithm I thought I could simplify the option with:
--tag-pattern '^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*)'
I'll give it a try with this one and see what I get.
But now I am thinking I could also directly do
--tag-pattern '^(CELL:N{57})(UMI:N{12})(R1:*)\^(R2:*)'
Thanks a lot!!
You can't use the same name several times, but you can do CELL1,CELL2,CELL3. Although if it's the exact same sequence using 3 cell barcodes is redundant.
Sorry for bothering again, but now I got an error:
Tags will be extracted using the following pattern: ^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*) The following tags and their roles were recognised: Payload tags: R1, R2 Molecule tags: UMI Cell tags: CELL Alignment: 0% picocli.CommandLine$ExecutionException: Error while running command (com.milaboratory.mixcr.cli.CommandAlign@541179e7): java.lang.NullPointerException: Cannot invoke "com.milaboratory.core.motif.BitapPattern.getAverageMismatchPenalty()" because "this.bitapPattern" is null
Hi, Mathilde!
Sorry, this is an internal limitation in current implementation of the pattern matching algorithm, the problem is that
8 + 49 + 12 = 69 > 64
and we use 64-bit-wide variables deep in the algo for speeding up the search.
In your case, with N letters only, without other wildcards or specific nucleotides, this limitation is rather artificial, and we will fix this behavior in 4.1. I will let you know when the pre-release build with this fix will be available. It will happen closer to October, with our current backlog, sorry.
All the best, Dmitry.
Hi Dmitry! Ok, no problem, I'll wait then. Looking forward to test your pipeline with my data. Thank you very much for the feedback!
Best, Mathilde
Let's leave this issue open until we fix it.
Hi there,
thank you for the new release! I just tried the following mixcr analyze command
mixcr analyze generic-bcr-amplicon-umi --species hsa --rna --force-overwrite --tag-pattern '^(CELL:N{57})(UMI:N{12})(R1:*)\^(R2:*)' --rigid-left-alignment-boundary --floating-right-alignment-boundary C fastq_files/myTestData_R1.fastq.gz fastq_files/myTestData_R2.fastq.gz result
But it seems i got the same error than last time
Alignment: 0% picocli.CommandLine$ExecutionException: Error while running command align java.lang.NullPointerException: Cannot invoke "com.milaboratory.core.motif.BitapPattern.getAverageMismatchPenalty()" because "this.bitapPattern" is null at com.milaboratory.mixcr.cli.Main.mkCmd$lambda-21(Main.kt:318)
Any way how to fix it? Thank you very much in advance
Mathilde
Hi Mathilde,
this should be fixed now. Please, try dev version, which you can download from:
https://link.milaboratory.com/software/mixcr/mixcr-develop.zip
Also we are adding dedicated presets for different protocols including 10x, BD, ParseBio etc, so if you use a public protocol and can point us to the source, we will be happy to add a dedicated analysis preset for convenience.
Hi Stanislav,
Thanks a lot for the update!
I'm now running mixcr as follow
~/TOOLS/mixcr-4.1.1-4/mixcr analyze generic-bcr-amplicon-umi --species hsa --rna \ --force-overwrite \ --tag-pattern '^(CELL1:N{8})N{16}(CELL2:N{8})N{16}(CELL3:N{8})N{1}(UMI:N{12})(R1:*)T{18}\^(R2:*)' \ --rigid-left-alignment-boundary --floating-right-alignment-boundary C \ fastq_files/myTestData_R1.fastq.gz fastq_files/myTestData_R2.fastq.gz .//S3/06.full_vdj/result
I got the following error:
Processing UMI: 92.9% ETA: 00:01:57 Processing UMI: 99.9% ETA: 00:00:01 picocli.CommandLine$ExecutionException: Error while running command refineTagsAndSort java.lang.IllegalArgumentException: Unexpected filter tag request. Requested UMI but next level is CELL1. at com.milaboratory.mixcr.cli.Main.mkCmd$lambda-21(Main.kt:322) at picocli.CommandLine.execute(CommandLine.java:2088) at com.milaboratory.mixcr.cli.CommandAnalyze$Cmd.run0(CommandAnalyze.kt:255) at com.milaboratory.mixcr.cli.MiXCRCommand.run(MiXCRCommand.kt:42) at picocli.CommandLine.executeUserObject(CommandLine.java:1939) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358) at picocli.CommandLine$RunLast.handle(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2314) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine$RunLast.execute(CommandLine.java:2316) at picocli.CommandLine.execute(CommandLine.java:2078) at com.milaboratory.mixcr.cli.Main.main(Main.kt:76)
FYI I have Singleron data (GEXSCOPE Single cell VDJ kit which gives only CDR3 as output) and I am trying to get some full length BCRs out of it. Thank you in advance for your help
Mathilde
Thank you Mathilde, can you please send us the link to the protocol? We will check and add a dedicated preset for this.
Hi,
there is not a detailed protocol: https://singleron.bio/product/common/assets/upload/2022/0803/164340qg.pdf The reads are PE 150bp and I managed to find some info about read 1 pattern here: https://github.com/singleron-RD/CeleScope/blob/master/docs/tools/barcode.md
Hi Mathilde,
Analysis of single-cell data is generally a bit more tricky, compared to just having UMIs, because cells add a whole new layer of complexity to error correction and filtering, and optimal analysis parameters vary greatly from one kit and machine manufacturer to another. Because of that, we don't have generic preset for single-cell, like we have for UMIs (i.e. generic-bcr-amplicon-umi
, the one you are using here).
We are searching for the good example dataset for the Singleron's kit with V(D)J probes, seems there are some in SRA. As soon as we have it, we will create a dedicated preset in several days. I will let you know!
All the best, Dmitry.
Regarding Dmitry's reply, if you can share some sample data, we'll provide you solution quickly. If that is possible, please, text us to [email protected] and we'll provide instructions for secure transfer.
Hi Dmitry, Stanislav,
I just send you some test fastq files (TCR) that I found on Singleron website or celescope github some months ago. Thank you so much!
Mathilde
Hi Mathilde,
please use this preset: https://docs.milaboratories.com/mixcr/reference/overview-built-in-presets/#singleron
It is available in the develop version (see this doc how to get it).
GEXSCOPE® Single Cell V(D)J Kit doesn't cover full-length. Apart from CDR3 it partially covers FR3 and FR4, so if you want to recover those pieces, then you can check raw alignments with exportAlignmentsPretty
to get idea how much of the regions around CDR3 you can get, and then use --assemble-clonotypes-by YOUR_FEATURE
to assemble clones by the feature larger than CDR3.
Let us know if we can assist you further.
Best, Stan