starcode icon indicating copy to clipboard operation
starcode copied to clipboard

Offending sequences

Open kjkjindal opened this issue 5 years ago • 1 comments

Hi, I am trying to run starcode sphere clustering on a set of sequences. These sequences contain certain (non-DNA) prefixes that I need to retain. I notice that starcode aborts when it encounters non-DNA characters in a sequence. Is this constraint essential to its (or specifically the sphere clustering algorithm's) function?

Thanks!

kjkjindal avatar Oct 21 '20 02:10 kjkjindal

Hi! The issue is not sphere clustering per se but sequence clustering itself. If two identical sequences have different non-DNA tags, how do you suggest to group the sequences in the same cluster?

I am not sure what your biological problem is, but I would recommend to approach it this way:

  1. Extract the pure DNA suffixes (make sure the lines match with the original file).
  2. Run starcode on the DNA suffixes and use the flag --seq-id.
  3. Use the row numbers in the output to get the clusters from the original file.

gui11aume avatar Oct 21 '20 15:10 gui11aume