SequelTools icon indicating copy to clipboard operation
SequelTools copied to clipboard

Issue with sample names

Open mldmort opened this issue 4 years ago • 7 comments

Hi,

I'm running SequelTools for 8 CLR samples. I'm giving the sample names with -u subfiles.txt option. In the subfiles.txt file I put the address of the bam files. This is my command: SequelTools.sh -t Q -u subFiles.txt -n 12 -p a -g a -o $OUT_DIR I am getting weird plots for my stats with the same name for each bam file. A sample plot is attached. Also the summaryTable.txt looks like this with the same number for all samples:

SMRTcell	numReadsSubread	numReadsLongestSub	totalBasesSubread	totalBasesLongestSub	meanReadLenSubread	meanReadLenLongestSub	medianReadLenSubread	medianReadLenLongestSub	n50Subread	n50LongestSub	l50Subread	l50LongestSub	PSR	ZOR
oasis	1320271	181528	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.137
oasis	2578421	377887	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.147
oasis	2252172	320325	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.142
oasis	2320629	335461	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.145
oasis	2266229	324966	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.143
oasis	2165289	302979	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.140
oasis	4398328	638727	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.145
oasis	2499748	348122	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.139

Would you let me know what's wrong? Thanks n50s.pdf

mldmort avatar Oct 21 '20 00:10 mldmort

Hello,

Thank you for using SequelTools! Subfiles.txt should be a file-of-filenames, which it sounds like it is in your case. These filenames are what determines the name of each SMRTcell in the output. Are your files all named oasis.bam? If so, changing those names to unique identifiers should resolve the issue. Let me know if that works for you.

Best, Dr. David E. Hufnagel

On Tue, Oct 20, 2020 at 7:27 PM mldmort [email protected] wrote:

Hi,

I'm running SequelTools for 8 CLR samples. I'm giving the sample names with -u subfiles.txt option. In the subfiles.txt file I put the address of the bam files. This is my command: SequelTools.sh -t Q -u subFiles.txt -n 12 -p a -g a -o $OUT_DIR I am getting weird plots for my stats with the same name for each bam file. A sample plot is attached. Also the summaryTable.txt looks like this with the same number for all samples:

SMRTcell numReadsSubread numReadsLongestSub totalBasesSubread totalBasesLongestSub meanReadLenSubread meanReadLenLongestSub medianReadLenSubread medianReadLenLongestSub n50Subread n50LongestSub l50Subread l50LongestSub PSR ZOR oasis 1320271 181528 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.137 oasis 2578421 377887 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.147 oasis 2252172 320325 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.142 oasis 2320629 335461 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2266229 324966 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.143 oasis 2165289 302979 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.140 oasis 4398328 638727 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2499748 348122 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.139

Would you let me know what's wrong? Thanks n50s.pdf https://github.com/ISUgenomics/SequelTools/files/5412390/n50s.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3LRDVARAAX6TUBYSYLSLYTGRANCNFSM4SZBSWCA .

DavidEHufnagel avatar Oct 21 '20 14:10 DavidEHufnagel

Hi,

my Subfiles.txt contain:

/projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam
/projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I thought that the names come from the bam files but it doesn't seems to. The name oasis appears in the output directory in the -o option: -o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

I don't know why oasis is chosen for the name of all the files and why the stats of the last file is chosen for all the cases. So I checked and it turns out that the stats in summaryTable.txt for all samples correspond to the last file.

Any idea why it happens? Thank,

mldmort avatar Oct 21 '20 17:10 mldmort

Hey Arun,

I hope you can see the whole conversation here. I'm a little perplexed by this problem. Do you have some ideas as to what's causing these issues?

Let me know, Best, David

On Wed, Oct 21, 2020 at 12:28 PM mldmort [email protected] wrote:

Hi,

my Subfiles.txt contain:

/projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I thought that the names come from the bam files but it doesn't seems to. The name oasis appears in the output directory in the -o option: -o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

I don't know why oasis is chosen for the name of all the files and why the stats of the last file is chosen for all the cases. So I checked and it turns out that the stats in summaryTable.txt for all samples correspond to the last file.

Any idea why it happens? Thank,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-713734921, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3OXPVQ7N2LG3524SD3SL4K5TANCNFSM4SZBSWCA .

DavidEHufnagel avatar Oct 21 '20 19:10 DavidEHufnagel

@mldmort from first glance, it looks like the -- in the file name is causing something unintended, can you please try it one more time renaming the bam files without double dash?

aseetharam avatar Oct 21 '20 19:10 aseetharam

Did this resolve the issue mldmort?

On Wed, Oct 21, 2020 at 2:12 PM Arun Seetharam [email protected] wrote:

@mldmort https://github.com/mldmort from first glance, it looks like the -- in the file name is causing something unintended, can you please try it one more time renaming the bam files without double dash?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-713816023, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3PNCQDWGBR6AEQ47ZLSL4W75ANCNFSM4SZBSWCA .

DavidEHufnagel avatar Oct 22 '20 16:10 DavidEHufnagel

Hi David,

No, I have used symbolic links to point to my bam files to see if it solves the problem. So my new subfiles.txt file looks like:

ACI.bam
BN.bam
BUF.bam
F344.bam
MR.bam
MS20.bam
WKY.bam
WN.bam

And the files link to the original bam files like:

ACI.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam
BN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam
BUF.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam
F344.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam
MR.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam
MS20.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam
WKY.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam
WN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I don't know if linking would be sufficient or not but maybe the next step is to change the original file name? but the name oasis which appears in the plots most probably come from the -o option:

-o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

That's the only place the name oasis appears. Also the summaryTable.txt is still flawed with the same numbers for each row:

SMRTcell	numReadsSubread	numReadsLongestSub	totalBasesSubread	totalBasesLongestSub	meanReadLenSubread	meanReadLenLongestSub	medianReadLenSubread	medianReadLenLongestSub	n50Subread	n50LongestSub	l50Subread	l50LongestSub	PSR	ZOR
oasis	1320271	181528	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.137
oasis	2320629	335461	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.145
oasis	2252172	320325	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.142
oasis	2165289	302979	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.140
oasis	2578421	377887	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.147
oasis	2266229	324966	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.143
oasis	4398328	638727	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.145
oasis	2499748	348122	21082583975	3794848484	8434	10901	8317	9856	9304	11125	885174	122214	0.180	0.139

Any suggestions? Thanks,

mldmort avatar Oct 22 '20 16:10 mldmort

Yes, I believe you will have to change the original names. I am doing additional testing for a demonstration of SequelTools I will be doing next week and unfortunately I'm finding that the required format for the names of the input files is quite rigid. It has to be something like this, "ID.scraps.bam" or "ID.subreads.bam", where ID is usually something like this, " m54138_180610_050652". That has been the structure of all the files I've seen come directly from PacBio sequencing machines. This software was published just this month and we are getting lots of feedback now on issues we did not come across before. You can expect updates coming in the next few weeks to make SequelTools more flexible and to resolve identified bugs and issues.

Best, David

On Thu, Oct 22, 2020 at 11:32 AM mldmort [email protected] wrote:

Hi David,

No, I have used symbolic links to point to my bam files to see if it solves the problem. So my new subfiles.txt file looks like:

ACI.bam BN.bam BUF.bam F344.bam MR.bam MS20.bam WKY.bam WN.bam

And the files link to the original bam files like:

ACI.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1001--bc1001.bam BN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1008--bc1008.bam BUF.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1003--bc1003.bam F344.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1010--bc1010.bam MR.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1002--bc1002.bam MS20.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1009--bc1009.bam WKY.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1011--bc1011.bam WN.bam -> /projects/long_reads/HS_founders/pacbio/demux/lima.bc1012--bc1012.bam

I don't know if linking would be sufficient or not but maybe the next step is to change the original file name? but the name oasis which appears in the plots most probably come from the -o option:

-o /oasis/scratch/comet/temp_project/RAT_DATA/HS_FOUNDERS/Pacbio_multiplex_all/QC/SequelToolsResults

That's the only place the name oasis appears. Also the summaryTable.txt is still flawed with the same numbers for each row:

SMRTcell numReadsSubread numReadsLongestSub totalBasesSubread totalBasesLongestSub meanReadLenSubread meanReadLenLongestSub medianReadLenSubread medianReadLenLongestSub n50Subread n50LongestSub l50Subread l50LongestSub PSR ZOR oasis 1320271 181528 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.137 oasis 2320629 335461 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2252172 320325 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.142 oasis 2165289 302979 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.140 oasis 2578421 377887 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.147 oasis 2266229 324966 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.143 oasis 4398328 638727 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.145 oasis 2499748 348122 21082583975 3794848484 8434 10901 8317 9856 9304 11125 885174 122214 0.180 0.139

Any suggestions? Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ISUgenomics/SequelTools/issues/7#issuecomment-714613307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQPE3MKL7GTM3ROYLACJCTSMBNCNANCNFSM4SZBSWCA .

DavidEHufnagel avatar Oct 22 '20 16:10 DavidEHufnagel