seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Feature Request: seqkit split option to control text appended to output files

Open ammaraziz opened this issue 3 years ago • 3 comments

Prerequisites

  • [X] make sure you're are using the latest version by seqkit version
  • [X] read the usage

Describe your issue

Hi Shenwei, seqkit is an amazing tool thank you for creating and maintaining it. I have a feature request, hopefully it is within the scope of seqkit.

I use seqkit split to individually separate multifasta files. split has an option to control for the name of the directory(-O, --out-dir string) but not for the prepended name which is the input file name. This means when I split files, I have to rename to remove the extra text.

For example:

cat multi.fasta
    > gene1
    .....
    > gene2
    .....

Current Output:

seqkit split multi.fasta -i -O example
....
[INFO] write 1 sequences to file: example/multi.id_gene2.fasta
[INFO] write 1 sequences to file: example/multi.id_gene1.fasta

If possible an option to control the prepended name of the individual files. Currently it is the file name + 'id' eg test.id_.

Expected output:

seqkit split multi.fasta -i -O example --out-filenames example
....
[INFO] write 1 sequences to file: example/example_gene2.fasta
[INFO] write 1 sequences to file: example/example_gene1.fasta

Also if possible it should accept nothing:

seqkit split multi.fasta -i -O example --out-filename -
....
[INFO] write 1 sequences to file: example/gene2.fasta
[INFO] write 1 sequences to file: example/gene1.fasta

A possible issue with excluding the output file names is that non-unique headers will need to be handled. Either the files are overwritten (might need a warning for this behavior) or ideally they are appended.

Thanks,

Ammar

ammaraziz avatar Mar 29 '22 00:03 ammaraziz

To my knowledge, you want to split the input file according to the sequence identities, but you also mention "non-unique headers" where the option-i/--by-id does not have any issue.

I'd like to recommend seqkit split2 if you do not need -i/--by-id nor -r/--by-region.

If possible an option to control the prepended name of the individual files.

Sure.

shenwei356 avatar Mar 31 '22 07:03 shenwei356

Except for the XXX which is the base name of the input file with file extension removed, there are 3 possible formats for the 3 kinds of jobs.

XXX.part_NNN.eee.EEE
XXX.id_III.eee.EEE
XXX.region_RRR:RRR_SSS.eee.EEE

While in seqkit split2, there's only one:

XXX.part_NNN.eee.EEE

I'm not sure which part should be configurable/replacable, maybe the XXX.part_, XXX.id_, XXX.region_?

Actually, it's easy and safe to batch rename the output files with the brename.

shenwei356 avatar Mar 31 '22 08:03 shenwei356

I agree the XXX.part_, XXX.id_, XXX.region_ would be ideal.

Actually, it's easy and safe to batch rename the output files with the brename.

Now that I think about it, and in hindsight of knowing brename exists, my request is a tad frivolous. I still think it would be handy.

Thanks again,

Ammar

ammaraziz avatar Apr 04 '22 07:04 ammaraziz

Added.

  -i, --by-id                     split squences according to sequence ID
      --by-id-prefix string       file prefix for --by-id
  -p, --by-part int               split sequences into N parts
      --by-part-prefix string     file prefix for --by-part
  -r, --by-region string          split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
      --by-region-prefix string   file prefix for --by-region
  -s, --by-size int               split sequences into multi parts with N sequences
      --by-size-prefix string     file prefix for --by-size

shenwei356 avatar Aug 12 '22 08:08 shenwei356

Available in v2.3.0 : https://github.com/shenwei356/seqkit/releases/tag/v2.3.0

shenwei356 avatar Aug 12 '22 15:08 shenwei356

Thank you!

ammaraziz avatar Aug 15 '22 00:08 ammaraziz