dnaio
dnaio copied to clipboard
Add Sequence.id attribute
and also Sequence.comment
and perhaps Sequence.sep
. They could be lazily computed. Perhaps also add .header
as an alias for .name
? Should setting .id
and .comment
be supported? Convenient, but perhaps slower than setting the full header directly.
Hi, in my latest pr #12 I noticed that adding methods is not without cost. It makes the parsing slightly slower.
Also I noticed a weird pattern. First it turned out I was on some 0.4.3 ish branch due to my fork being outdated. Parsing 5 million illumina records happened in 1.9 seconds on my PC.
Then I updated to the latest main branch. 2.150 seconds. I added a qualities_bytes and sequence_bytes method. 2.250 seconds. Okay, I do 100% sure need the qualities_bytes, but the sequence_bytes was added just-in-case. Lets remove that one: 1.9 seconds again :astonished: .
Okay so what happens if I add a dummy_method that does nothing. So I added qualities_bytes
, sequence_bytes
, dummy_nonsense
. 2.0 seconds....
Very weird. Overall adding more methods seems to decrease parsing speeds. (Presumably because the Sequence object gets bigger). And there is this weird pattern, but that could be an interaction between these versions of GCC, glibc and python on my PC. I hope it is not reproducible on other pcs, because then it affects the wheel building as well.
My takeaway was not to add methods unless they are strictly needed.
That is very strange! Adding methods shouldn’t make the instances bigger, though.
To circle back on this. Adding methods makes the internal dict representing the type bigger, but otherwise I think it is a stochastic thing depending on how Cython generates the member functions with its very long names. The difference are usually minor. I wrote that 2021 comment back when I did not have as much knowledge about C and C and Python specifically.
We can maybe set an ID property or an ID method. The ID method will be faster as METH_NOARGS calling conventions are more optimized than property calls. Also an id()
method will prevent people from trying to change the id
member of the object. So I think a method is a better fit.
Also this would require creating a new object on each call for which memory needs to be allocated. This will be the majority of the cost. The strcspn
call will probably be dwarfed by the allocation overhead.
My initial reason for posting this was the realization that id and comment being stored together is just incidental to how the FASTQ format is designed, but that they actually should be separate attributes. So ignoring performance aspects, I think a more logically structured SequenceRecord would have these attributes:
- id
- comment
- sequence
- qualities
Interestingly, adding UBAM support (#109) would be an argument in favor of modeling it this way: The QNAME in a SAM/BAM record cannot contain spaces, so it would naturally be represented as SequenceRecord.id
, allowing for a comment to come from somewhere else (the CO:Z:
tag I’d assume).
So with the above structure, it would be more consistent to allow .id
to be written without parentheses, which implies that it would need to become a property instead of a function. Of course it could be a read-only property (at least initially).
I don’t think access needs to allocate a new string for each access. The implementation could compute the .id
and .comment
values on first access and then cache them. Subsequent accesses would then return the existing object and would only need to increase the reference count.
Ah, so we would make _id
and _comment
in the struct as well and use them to cache, with the name attribute being the actual storage. That would still work with backwards-compatibility. That is a very good idea!
For comments from ubam I suggest that we use the samtools fastq format. This will basically use the SAM tag format in the FASTQ header. It is a known format and it is nice not to come up with our own convention.
Question. What do we do if comment is empty. ""
or None
? I am in favour of ""
since the split() method will work correctly and always return an empty list. I guess split will be the most used method on comment
. Having an optional type is quite annoying in my opinion, but also quite pythonic.
Ah, so we would make _id and _comment in the struct as well and use them to cache, with the name attribute being the actual storage.
Yes, something like that. We just need to invalidate the cache when an assignment is made to .name
.
Perhaps .header
could also become an alias for .name
(but maybe this is misleading because users might expect the @
to be included).
For comments from ubam I suggest that we use the samtools fastq format.
What do you mean specifically? How does samtools fastq
handle comments?
What do we do if comment is empty.
""
orNone
?
I’d usually model a nonexisting comment as None
because modeling it as the empty string means that there’s no way to distinguish empty comments from nonexisting ones. It is also pythonic as you say ... On the other hand, maybe there’s no need for this distinction becaue I don’t see the FASTQ parser producing empty comments: This could only happen if we interpret @theid
(with a trailing whitespace) as having an empty comment, but I’d prefer if this was seen as a nonexisting comment. (That said, CO:Z:
would be valid in SAM/BAM in case we choose to intepret that as the comment field.)
Maybe it’s easy to say for me, but I’d argue it’s a good thing that the comment can be None
because it forces the user to explicitly consider that case. I mean, at some point you will have to handle that case. After you use split()
, maybe you don’t get the expected number of fields and you have to deal with the error then.
I’d usually model a nonexisting comment as
None
because modeling it as the empty string means that there’s no way to distinguish empty comments from nonexisting ones. It is also pythonic as you say ... On the other hand, maybe there’s no need for this distinction becaue I don’t see the FASTQ parser producing empty comments:
Okay, let's go with None
.
This could only happen if we interpret
@theid
(with a trailing whitespace) as having an empty comment, but I’d prefer if this was seen as a nonexisting comment.
That would be a weird corner case to tackle in the code. Regardless None
and ""
have the same truthy value so if seq.comment:
will lead to the same result. So let's not handle the corner case until there are real-world examples that indicate that trailing whitespace no-comment headers are a thing.
What do you mean specifically? How does samtools fastq handle comments?
Example @4def3027-c3be-41d9-b4c1-9571b308cbdf qs:i:24 du:f:3.2848 ns:i:16424 ts:i:10 mx:i:2 ch:i:974
. So first the id and then tab-separated the tags encoded in SAM format. Doing this indiscriminately might lead to very long name
attributes though. Only selecting the CO:Z:
tag would make the ubam parsing a lot easier. Then @header\tjust copy the CO:Z:tag
makes more sense, with on the flip side losing a lot of information.
An alternative is not to parse the tags at all and leave the comment empty (Very fast!). For nanopore uBAM this is not very nice as there is some important information in there about the channel, start time etc.
Yet another is to make tag parsing an optional switch that the user can turn on. In that case I think all or nothing is the best approach.
Maybe I could start with just setting comment None and then implement "comment from tags" or something later? That way the parser is more simple (tags are a pain to parse) and only work is done when people actually need it? Does cutadapt do something with the comment?
That would be a weird corner case to tackle in the code. Regardless None and "" have the same truthy value so if seq.comment: will lead to the same result. So let's not handle the corner case until there are real-world examples that indicate that trailing whitespace no-comment headers are a thing.
I am currently implementing it. In a C extension this would have been easier. However, with Cython I can only have objects that are None. For anything more sophisticated using PyObject * that are NULL Cython becomes hairy fast and a C-extension is preferable. Also using any C members requires a __cinit__
and that slows parsing a lot.
So internally _id
and _comment
are None when not cached. When _comment is empty, it will internally be a ""
. This is checked for with PyUnicode_GET_LENGTH() (fast). And if 0 it will return None.
So @theid
will have a seq.comment None
and a seq.id theid
and a seq.name theid
.