seq_io icon indicating copy to clipboard operation
seq_io copied to clipboard

Fasta id should split after ANY whitespace

Open nh13 opened this issue 2 years ago • 1 comments

nh13 avatar Nov 29 '23 17:11 nh13

Thanks for the pull request. I apologize for not yet looking at this. The reason is that, even though I agree that this is a good idea, I would prefer to treat byte strings (&[u8]) and &str types in a consistent way. Right now, the implementation mostly recognizes spaces and tabs, but id_desc() recognizes additional ASCII and Unicode characters by its use of char::is_whitespace. This may also have performance implications.

It may be worth doing some investigation on how other parsers handle the problem. Biopython uses title.split(None, 1)[0], which is similar to u8::is_ascii_whitespace resp. char::is_whitespace according to my (superficial) research.

In addition, str.split in Biopython recognizes runs of consecutive whitespace, so there will not be any descriptions starting with whitespace. However, I'm inclined not to follow that behaviour, since it is computationally more intensive and users wanting that behaviour can still trim the description by themselves using str::trim [unfortunately, slice::trim_ascii is not stable yet...].

markschl avatar Feb 17 '24 11:02 markschl