seqkit
seqkit copied to clipboard
Feature Request: seqkit split option to control text appended to output files
Prerequisites
- [X] make sure you're are using the latest version by
seqkit version - [X] read the usage
Describe your issue
Hi Shenwei, seqkit is an amazing tool thank you for creating and maintaining it. I have a feature request, hopefully it is within the scope of seqkit.
I use seqkit split to individually separate multifasta files. split has an option to control for the name of the directory(-O, --out-dir string) but not for the prepended name which is the input file name. This means when I split files, I have to rename to remove the extra text.
For example:
cat multi.fasta
> gene1
.....
> gene2
.....
Current Output:
seqkit split multi.fasta -i -O example
....
[INFO] write 1 sequences to file: example/multi.id_gene2.fasta
[INFO] write 1 sequences to file: example/multi.id_gene1.fasta
If possible an option to control the prepended name of the individual files. Currently it is the file name + 'id' eg test.id_.
Expected output:
seqkit split multi.fasta -i -O example --out-filenames example
....
[INFO] write 1 sequences to file: example/example_gene2.fasta
[INFO] write 1 sequences to file: example/example_gene1.fasta
Also if possible it should accept nothing:
seqkit split multi.fasta -i -O example --out-filename -
....
[INFO] write 1 sequences to file: example/gene2.fasta
[INFO] write 1 sequences to file: example/gene1.fasta
A possible issue with excluding the output file names is that non-unique headers will need to be handled. Either the files are overwritten (might need a warning for this behavior) or ideally they are appended.
Thanks,
Ammar
To my knowledge, you want to split the input file according to the sequence identities, but you also mention "non-unique headers" where the option-i/--by-id does not have any issue.
I'd like to recommend seqkit split2 if you do not need -i/--by-id nor -r/--by-region.
If possible an option to control the prepended name of the individual files.
Sure.
Except for the XXX which is the base name of the input file with file extension removed, there are 3 possible formats for the 3 kinds of jobs.
XXX.part_NNN.eee.EEE
XXX.id_III.eee.EEE
XXX.region_RRR:RRR_SSS.eee.EEE
While in seqkit split2, there's only one:
XXX.part_NNN.eee.EEE
I'm not sure which part should be configurable/replacable, maybe the XXX.part_, XXX.id_, XXX.region_?
Actually, it's easy and safe to batch rename the output files with the brename.
I agree the XXX.part_, XXX.id_, XXX.region_ would be ideal.
Actually, it's easy and safe to batch rename the output files with the brename.
Now that I think about it, and in hindsight of knowing brename exists, my request is a tad frivolous. I still think it would be handy.
Thanks again,
Ammar
Added.
-i, --by-id split squences according to sequence ID
--by-id-prefix string file prefix for --by-id
-p, --by-part int split sequences into N parts
--by-part-prefix string file prefix for --by-part
-r, --by-region string split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
--by-region-prefix string file prefix for --by-region
-s, --by-size int split sequences into multi parts with N sequences
--by-size-prefix string file prefix for --by-size
Available in v2.3.0 : https://github.com/shenwei356/seqkit/releases/tag/v2.3.0
Thank you!