starcode Feature Request: Tidy Output

I was wondering if it would be possible to output clusters in a tidy format rather than the existing wide format. For example:

> cat test.tsv
GGGG	50
AGGG	10
AAGG	5
AAAG	20
TGGG	20
TTTT	100

>  starcode -q -d1 --sphere --print-clusters -i test.tsv
TTTT	100	TTTT
GGGG	80	GGGG,AGGG,TGGG
AAAG	25	AAAG,AAGG

> starcode -q -d1 --sphere --print-tidy-clusters -i test.tsv
TTTT    100     TTTT    100
GGGG    80      GGGG    50
GGGG    80      AGGG    10
GGGG    80      TGGG    20
AAAG    25      AAAG    20
AAAG    25      AAGG    5

Obviously this not critical, but I think it would be a useful feature.

Dec 05 '18 00:12 nlubock

Hi nlubock, thanks for your request.

Indeed it's an interesting feature that could be implemented without too much effort given the latest updates in the clustering algorithms.

We will consider it and get back to you soon.

Dec 05 '18 08:12 ezorita

Hey folks, I came here to ask for something similar, so I figured I'd just add a comment here. I am currently using starcode in bioinformatics pipelines to cluster to centroids, then am counting combinations of 4 sets of barcodes later. So I would be very interested in a mode that could output tables with tuples sort of like:

[ input line number, clustered centroid sequence ]
[ unique input sequence, clustered centroid sequence ]

My current solution is to use AWK:grimacing: on the cluster-id-containing output file like:

mawk '{ split($3,a,","); for (i in a){ print a[i] "," $1 } }'

Do y'all think the first option (input line number, clustered centroid sequence) would be easy to implement? I took an intro to C course 10 years ago and mainly use shell/R/python , do you think it'd be helpful for me to sketch out a prototype? I assume the output-making code would be easy to find?

Jan 26 '21 18:01 darachm

Hey @nlubock , a similar feature is now ready for testing in the feature/tidy branch, as discussed on this pull request : https://github.com/gui11aume/starcode/pull/40#issuecomment-925377411 So if you're still working with this, maybe give it a spin? It's different than you describe, but should still work.

Sep 23 '21 15:09 darachm