mixcr icon indicating copy to clipboard operation
mixcr copied to clipboard

How to use tag pattern with unknown linker sequence?

Open MathildeFogPerez opened this issue 2 years ago • 7 comments

Hi,

I am trying to use the tag-pattern option and I would like to to something like the following:

--tag-pattern '^(CELL:N{8})NNUC:N{49}(UMI:N{12})(R1:)^(R2:)'

Where NNUC is made of 49 nucleotides. In reality it is made of the repetition of a linker and the cell barcode, but to simplify I would like to use it as a sequence of 49 nucleotides.

Is there a way to do this?

Thanks a lot in advance

best, Mathilde

MathildeFogPerez avatar Sep 15 '22 07:09 MathildeFogPerez

Hi Mathilde,

There are two ways to do that depending of what do you want to get at the end:

  1. Simply specify 49 nucleotides in between CELL and UMI without being able to export these nucleotides:

     --tag-pattern '^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*)'
    
  2. If you want to have an access to this group at the end, since you have mentioned that this group is related to CELL barcode, you can specify it as another CELL barcode.

      --tag-pattern '^(CELL:N{8})(CELL2:N{49})(UMI:N{12})(R1:*)\^(R2:*)'
    

    Note: the name of the group should start with CELL, and that is the only constrain here. So you can name it: (CELL_NNUC:N{49}) as well.

    PS. Since this added group (CELL2:N{49}) is denoted as a CELL barcode it will be used as one in the analysis. But if I understand you correctly it is in fact a CELL barcode (meaning its sequence differs between cells and should be equal within a cell) so that should not be a problem. Nevertheless, in one of the upcoming updates we will add a feature that will allow to specify any other named group which will not be used by any algorithms but will be available for export in the output.

mizraelson avatar Sep 15 '22 16:09 mizraelson

Hi,

Thanks a lot for you answer this is very helpful! Actually the exact pattern would be

--tag-pattern '^(CELL:N{8})N{16}(CELL:N{8})N{16}(CELL:N{8})N{1}(UMI:N{12})(R1:*)\^(R2:*)'

Do you think I can use the same group name 'CELL' several times? I do not have more information about the R1, and I guessed it should be the same CELL barcode repeated several times.

But to be safer and to make it easier for you algorithm I thought I could simplify the option with:

--tag-pattern '^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*)'

I'll give it a try with this one and see what I get.

But now I am thinking I could also directly do

--tag-pattern '^(CELL:N{57})(UMI:N{12})(R1:*)\^(R2:*)'

Thanks a lot!!

MathildeFogPerez avatar Sep 16 '22 08:09 MathildeFogPerez

You can't use the same name several times, but you can do CELL1,CELL2,CELL3. Although if it's the exact same sequence using 3 cell barcodes is redundant.

mizraelson avatar Sep 16 '22 08:09 mizraelson

Sorry for bothering again, but now I got an error:

Tags will be extracted using the following pattern: ^(CELL:N{8})N{49}(UMI:N{12})(R1:*)\^(R2:*) The following tags and their roles were recognised: Payload tags: R1, R2 Molecule tags: UMI Cell tags: CELL Alignment: 0% picocli.CommandLine$ExecutionException: Error while running command (com.milaboratory.mixcr.cli.CommandAlign@541179e7): java.lang.NullPointerException: Cannot invoke "com.milaboratory.core.motif.BitapPattern.getAverageMismatchPenalty()" because "this.bitapPattern" is null

MathildeFogPerez avatar Sep 16 '22 08:09 MathildeFogPerez

Hi, Mathilde!

Sorry, this is an internal limitation in current implementation of the pattern matching algorithm, the problem is that 8 + 49 + 12 = 69 > 64 and we use 64-bit-wide variables deep in the algo for speeding up the search.

In your case, with N letters only, without other wildcards or specific nucleotides, this limitation is rather artificial, and we will fix this behavior in 4.1. I will let you know when the pre-release build with this fix will be available. It will happen closer to October, with our current backlog, sorry.

All the best, Dmitry.

dbolotin avatar Sep 16 '22 09:09 dbolotin

Hi Dmitry! Ok, no problem, I'll wait then. Looking forward to test your pipeline with my data. Thank you very much for the feedback!

Best, Mathilde

MathildeFogPerez avatar Sep 16 '22 09:09 MathildeFogPerez

Let's leave this issue open until we fix it.

dbolotin avatar Sep 16 '22 09:09 dbolotin

Hi there,

thank you for the new release! I just tried the following mixcr analyze command

mixcr analyze generic-bcr-amplicon-umi --species hsa --rna --force-overwrite --tag-pattern '^(CELL:N{57})(UMI:N{12})(R1:*)\^(R2:*)' --rigid-left-alignment-boundary --floating-right-alignment-boundary C fastq_files/myTestData_R1.fastq.gz fastq_files/myTestData_R2.fastq.gz result

But it seems i got the same error than last time

Alignment: 0% picocli.CommandLine$ExecutionException: Error while running command align java.lang.NullPointerException: Cannot invoke "com.milaboratory.core.motif.BitapPattern.getAverageMismatchPenalty()" because "this.bitapPattern" is null at com.milaboratory.mixcr.cli.Main.mkCmd$lambda-21(Main.kt:318)

Any way how to fix it? Thank you very much in advance

Mathilde

MathildeFogPerez avatar Oct 21 '22 08:10 MathildeFogPerez

Hi Mathilde,

this should be fixed now. Please, try dev version, which you can download from:

https://link.milaboratory.com/software/mixcr/mixcr-develop.zip

Also we are adding dedicated presets for different protocols including 10x, BD, ParseBio etc, so if you use a public protocol and can point us to the source, we will be happy to add a dedicated analysis preset for convenience.

PoslavskySV avatar Nov 13 '22 19:11 PoslavskySV

Hi Stanislav,

Thanks a lot for the update! I'm now running mixcr as follow ~/TOOLS/mixcr-4.1.1-4/mixcr analyze generic-bcr-amplicon-umi --species hsa --rna \ --force-overwrite \ --tag-pattern '^(CELL1:N{8})N{16}(CELL2:N{8})N{16}(CELL3:N{8})N{1}(UMI:N{12})(R1:*)T{18}\^(R2:*)' \ --rigid-left-alignment-boundary --floating-right-alignment-boundary C \ fastq_files/myTestData_R1.fastq.gz fastq_files/myTestData_R2.fastq.gz .//S3/06.full_vdj/result

I got the following error:

Processing UMI: 92.9% ETA: 00:01:57 Processing UMI: 99.9% ETA: 00:00:01 picocli.CommandLine$ExecutionException: Error while running command refineTagsAndSort java.lang.IllegalArgumentException: Unexpected filter tag request. Requested UMI but next level is CELL1. at com.milaboratory.mixcr.cli.Main.mkCmd$lambda-21(Main.kt:322) at picocli.CommandLine.execute(CommandLine.java:2088) at com.milaboratory.mixcr.cli.CommandAnalyze$Cmd.run0(CommandAnalyze.kt:255) at com.milaboratory.mixcr.cli.MiXCRCommand.run(MiXCRCommand.kt:42) at picocli.CommandLine.executeUserObject(CommandLine.java:1939) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358) at picocli.CommandLine$RunLast.handle(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2314) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine$RunLast.execute(CommandLine.java:2316) at picocli.CommandLine.execute(CommandLine.java:2078) at com.milaboratory.mixcr.cli.Main.main(Main.kt:76)

FYI I have Singleron data (GEXSCOPE Single cell VDJ kit which gives only CDR3 as output) and I am trying to get some full length BCRs out of it. Thank you in advance for your help

Mathilde

MathildeFogPerez avatar Nov 14 '22 15:11 MathildeFogPerez

Thank you Mathilde, can you please send us the link to the protocol? We will check and add a dedicated preset for this.

PoslavskySV avatar Nov 14 '22 15:11 PoslavskySV

Hi,

there is not a detailed protocol: https://singleron.bio/product/common/assets/upload/2022/0803/164340qg.pdf The reads are PE 150bp and I managed to find some info about read 1 pattern here: https://github.com/singleron-RD/CeleScope/blob/master/docs/tools/barcode.md

MathildeFogPerez avatar Nov 14 '22 15:11 MathildeFogPerez

Hi Mathilde,

Analysis of single-cell data is generally a bit more tricky, compared to just having UMIs, because cells add a whole new layer of complexity to error correction and filtering, and optimal analysis parameters vary greatly from one kit and machine manufacturer to another. Because of that, we don't have generic preset for single-cell, like we have for UMIs (i.e. generic-bcr-amplicon-umi, the one you are using here).

We are searching for the good example dataset for the Singleron's kit with V(D)J probes, seems there are some in SRA. As soon as we have it, we will create a dedicated preset in several days. I will let you know!

All the best, Dmitry.

dbolotin avatar Nov 14 '22 18:11 dbolotin

Regarding Dmitry's reply, if you can share some sample data, we'll provide you solution quickly. If that is possible, please, text us to [email protected] and we'll provide instructions for secure transfer.

PoslavskySV avatar Nov 15 '22 00:11 PoslavskySV

Hi Dmitry, Stanislav,

I just send you some test fastq files (TCR) that I found on Singleron website or celescope github some months ago. Thank you so much!

Mathilde

MathildeFogPerez avatar Nov 15 '22 07:11 MathildeFogPerez

Hi Mathilde,

please use this preset: https://docs.milaboratories.com/mixcr/reference/overview-built-in-presets/#singleron

It is available in the develop version (see this doc how to get it).

GEXSCOPE® Single Cell V(D)J Kit doesn't cover full-length. Apart from CDR3 it partially covers FR3 and FR4, so if you want to recover those pieces, then you can check raw alignments with exportAlignmentsPretty to get idea how much of the regions around CDR3 you can get, and then use --assemble-clonotypes-by YOUR_FEATURE to assemble clones by the feature larger than CDR3.

Let us know if we can assist you further.

Best, Stan

PoslavskySV avatar Dec 21 '22 21:12 PoslavskySV