Feature Request: Tidy Output
I was wondering if it would be possible to output clusters in a tidy format rather than the existing wide format. For example:
> cat test.tsv
GGGG 50
AGGG 10
AAGG 5
AAAG 20
TGGG 20
TTTT 100
> starcode -q -d1 --sphere --print-clusters -i test.tsv
TTTT 100 TTTT
GGGG 80 GGGG,AGGG,TGGG
AAAG 25 AAAG,AAGG
> starcode -q -d1 --sphere --print-tidy-clusters -i test.tsv
TTTT 100 TTTT 100
GGGG 80 GGGG 50
GGGG 80 AGGG 10
GGGG 80 TGGG 20
AAAG 25 AAAG 20
AAAG 25 AAGG 5
Obviously this not critical, but I think it would be a useful feature.
Hi nlubock, thanks for your request.
Indeed it's an interesting feature that could be implemented without too much effort given the latest updates in the clustering algorithms.
We will consider it and get back to you soon.
Hey folks, I came here to ask for something similar, so I figured I'd just add a comment here. I am currently using starcode in bioinformatics pipelines to cluster to centroids, then am counting combinations of 4 sets of barcodes later. So I would be very interested in a mode that could output tables with tuples sort of like:
- [ input line number, clustered centroid sequence ]
- [ unique input sequence, clustered centroid sequence ]
My current solution is to use AWK:grimacing: on the cluster-id-containing output file like:
mawk '{ split($3,a,","); for (i in a){ print a[i] "," $1 } }'
Do y'all think the first option (input line number, clustered centroid sequence) would be easy to implement? I took an intro to C course 10 years ago and mainly use shell/R/python , do you think it'd be helpful for me to sketch out a prototype? I assume the output-making code would be easy to find?
Hey @nlubock , a similar feature is now ready for testing in the feature/tidy branch, as discussed on this pull request : https://github.com/gui11aume/starcode/pull/40#issuecomment-925377411 So if you're still working with this, maybe give it a spin? It's different than you describe, but should still work.