tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Implement __getitem__ in ReferenceSequence

Open jeromekelleher opened this issue 4 years ago • 2 comments

It would be nice to implement the things like

# What is the reference base at position 12345?
ts.reference_sequence[12345]
# What are the bases between position a and b?
ts.reference_sequence[a:b]

etc. Assuming the slices are reasonably small, this can be done efficiently by implementing the __getitem__ method, and either accessing the data via numpy arrays, or just implementing in the Python C layer.

In the future, we can implement the data fetching methodology from URLs in the background to implement these queries.

jeromekelleher avatar Dec 02 '21 15:12 jeromekelleher

In general, I guess what we want to to present a str-like API, so that we generally regard the ts.reference_sequence instance as a unicode string, with some extra methods that let us figure out where it came from, etc.

One odd thing then though would be that __str__ would not return the string itself, but a useful summary.

jeromekelleher avatar Dec 02 '21 20:12 jeromekelleher

It would actually be very helpful not to return the string. At the moment, when I have a whole chromosome reference sequence, if I print it to screen it can stall my python REPL / notebook cell. It would be preferable to show a truncated string.

hyanwong avatar Mar 22 '24 23:03 hyanwong