asttokens Consider a clearer optional-functionality API

Currently ASTTokens serves a number of different roles depending on what arguments its passed. With https://github.com/gristlabs/asttokens/pull/93 this is growing further and the workarounds to type checking being added in that PR perhaps suggest that the limits of the current approach are being reached. https://github.com/gristlabs/asttokens/pull/95 will add another construction, though not another role. I understand the goals of these different roles are around avoiding redundant parsing/tree-building work where not needed, which is a reasonable goal. I'm not suggesting a removal of functionality here, rather that the functionality be rearranged for clarity and to aid type checking.

There are at least the following modes of use:

atok = ASTTokens(source, parse=True)
assert atok.tree, "Placate type check even though we know this is always valid"

atok = ASTTokens(source, tree=existing_tree)
assert atok.tree, "Placate type check even though we know this is always valid"

There is also currently ASTTokens(source, tree=None, parse=False), though it's not clear what use that has (https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252862533).

With #93 there's also this case:

atok = ASTTokens(source, init_tokens=False)
atok.get_token_from_offset(42)  # Fail at runtime, but no type-check warnings

These cases are somewhat "obvious", however if the construction of the ASTTokens instance is far from the point of use they could easily be very non-obvious.

We can observe that ASTTokens contains broadly three classes of functionality:

tokens-only
tree-only
both together

With the goal of enabling efficient use of the related functionalities I can see a couple of paths forwards:

lazy evaluation of the relevant structured data from the source
failures when using functionality with missing data

For the former I would advocate for moving all the public-member types to being non-optional and using something like https://pypi.org/project/cached-property/ (or functools.cached_property in Python 3.8+) to hide the evaluations. This would still support users passing in pre-computed values if they want but means that all usages will "just work":

atok = ASTTokens(source)
tree: ast.AST = atok.tree  # no error now that `.tree` is non-optional

atok = ASTTokens(source, init_tokens=False)  # request lazy tokens initialisation
atok.get_token_from_offset(42)  # works due to lazy parsing

I can see that this approach may be undesirable in some cases, perhaps where there are reasons to prefer not doing the parsing at all.

This leads to the other approach (errors for missing functionality), for which I propose using separate classes:

# ASTSource is a thin wrapper around LineNumbers. (Very open to better names.)
# Does not take a tree or tokens at all, but works with un-marked AST nodes
asrc = ASTSource(source)
asrc.tree  # type check error: no such member
asrc.get_text(node)  # type check error: no such member
asrc.get_token_from_offset(42)  # type check error: no such member
asrc.get_text_unmarked(node)  # works fine

# TokensSource is a new type which contains the tokens/source utils currently in ASTTokens
# It wraps LineNumbers similarly to ASTSource, but is otherwise unrelated
# It does not have a tree
tsrc = TokensSource(source)  # parses tokens if not passed
tsrc.tree  # type check error: no such member
tsrc.get_text(node)  # type check error: no such member
tsrc.get_token_from_offset(42)  # works fine
tsrc.get_text_unmarked(node)  # type check error: no such member

# ASTTokens now always has a tree and tokens, creating them both if not required
# It probably inherits from TokensSource and drops the `get_text*_unmarked` methods as they no longer make sense here
atok = ASTTokens(source)
tree: ast.AST = atok.tree  # no error now that `.tree` is non-optional
atok.get_text(node)  # works fine
atok.get_token_from_offset(42)  # also works fine
atok.get_text_unmarked(node)  # type check error: no such member

Aside from the question of whether ASTTokens keeps its get_text*_unmarked methods, the usages here are essentially compatible with existing code. The constructor signature might change, though keeping the existing arguments for compatibility is probably possible; if so then ASTTokens(source, tree=None, parse=False) would now be at least a runtime error (with @overloads it could be a type error too).

Note also that this approach is compatible with another class, perhaps LazyASTTokens which looks like ASTTokens but has lazy evaluation. This could easily be added later (or not at all) depending on whether it ends up being useful.

The main gain here is that each of these specialised classes addresses a specific use-case and data-requirements, which remove the need for conditional presence checks (or asserts) throughout both the library code and client code. This will make the code clearer, easier to maintain and (due to the type checker support) more robust as well as offering the performance gains desired.

Oct 01 '22 12:10 PeterJCLaw

How does this solve the last problem mentioned in https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252844046? I need to be able to avoid using tokens when possible, but for ASTTokens to automatically start using them under the hood when needed by get_text.

Oct 01 '22 12:10 alexmojaki

How does this solve the last problem mentioned in #94 (comment)? I need to be able to avoid using tokens when possible, but for ASTTokens to automatically start using them under the hood when needed by get_text.

This is what I meant by having a lazy-evaluation approach, either by just making ASTTokens lazy as a whole or if we go the separation route then having a LazyASTTokens class.

I'm a bit confused though -- you mentioned in https://github.com/gristlabs/asttokens/pull/93#issuecomment-1264303249 that you didn't want the late-parsing behaviour?

I feel I'm missing something -- why do you want both lazy-evaluation behaviour and error-on-missing behaviour in the same class at the same time?

Oct 01 '22 12:10 PeterJCLaw

I want the get_text methods to lazily initialise tokens when needed. The role I care about is text positions in the AST, not tokens, but when tokens are needed for that role is non-obvious and a bit messy.

Whereas this:

atok = ASTTokens(source, init_tokens=False)
atok.get_token_from_offset(42)  # Fail at runtime, but no type-check warnings

doesn't worry me.

Oct 01 '22 12:10 alexmojaki

Having some of the functionality of the class have lazy-evaluation while others error about the missing data feels like a recipe for a very confusing API and a lot of frustrated (developer) users, even if it's something that's opt-in as you're proposing.

Could you expand on why you feel that lazy evaluation in the other cases (e.g: get_token_from_offset) is a bad thing if it's being supported for other cases?

As an alternative approach, how important is it that this evaluation is fully lazy rather than something that can be upgraded to?

Would something like this work:

asrc = ASTSource(source)

# use AST + source only stuff for a bit
asrc.get_text(node)  # type error: no member

# maybe even allowing passing in tokens here, though that might be overkill
atok: ASTTokens = asrc.to_asttokens()

atok.get_text(node)  # works

If a consumer really did want its own mix of error/lazy handling as you suggest then it wouldn't be that much effort for them to wrap this:

class Mixed:
    def __init__(self, asrc: ASTSource):
        self.instance = asrc

   def __getattr__(self, name):
        if name.startswith('get_text') and isinstance(self.instance, ASTSource):
            self.instance = self.instance.to_asttokens()

        # Handling for get_text*_unmarked omitted for brevity

        return getattr(self.instance, name)

Obviously all type checking goes out the window with this sort of thing, however what you're proposing isn't really compatible with static analysis anyway so I'm assuming that that's not a concern.

Oct 01 '22 12:10 PeterJCLaw

Could you expand on why you feel that lazy evaluation in the other cases (e.g: get_token_from_offset) is a bad thing if it's being supported for other cases?

Using a token-specific method after specifying init_tokens=False is generally a mistake that should be complained about. In my case I want the error because I want to know that my code is only calling get_text methods and thus init_tokens=False is actually giving the performance benefits I want.

Would something like this work:

No because get_text is the essential method that's always needed.

BTW I should also point out that get_text_unmarked is already being removed.

Oct 01 '22 13:10 alexmojaki

Using a token-specific method after specifying init_tokens=False is generally a mistake that should be complained about.

I'm not really sure I follow the logic here given that get_text depends (via mark_tokens) on many of the token-specific methods, however if you want to consider those uses as OK due to being internal then maybe that's not an issue?

It's worth considering that after get_text is called (and the tokens initialised) then the other token-specific methods will work, adding to the potential confusion from this semi-lazy design.

In my case I want the error because I want to know that my code is only calling get_text methods and thus init_tokens=False is actually giving the performance benefits I want.

Ok, but what about other use-cases? It's not clear to me that one specific use-case justifies creating a complicated API for other cases.

Backing up a bit, assuming that the feature set desired for https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252844046 (from reading the related issues & packages) is:

has source
tokens not initialised upfront
can assume there's a tree passed in
needs get_text and get_text_range to work automatically
doesn't need anything else

then I'm really not clear why just letting everything be lazy is an issue -- if you're not using the other members why should you care whether they error?

For clarity -- by "let everything be lazy" I mean fully leaning into that. ASTTokens(source, tree=None, parse=False) becomes invalid (likely we deprecate the parse argument entirely) and the constructor now never does any tree construction or tokenisation. There's then no need for a init_tokens argument (since it's now always False) and access to .tree or .tokens would initialise those respectively and everyone gets the benefit of not doing work they don't need, not just the one specific use-case.

Probably mark_tokens remains available for users that just want to mark an AST rather than using the ASTTokens class, but otherwise it too is only called as needed. Alternatively or additionally, maybe there's a lazy: bool argument for users who want ASTTokens to do everything upfront for whatever reason.

This also avoids the issue of using token-specific functionality after constructing an object with init_tokens=False.

Is there a drawback to this approach that I'm missing?

Oct 01 '22 15:10 PeterJCLaw

I'm not really sure I follow the logic here given that get_text depends (via mark_tokens) on many of the token-specific methods, however if you want to consider those uses as OK due to being internal then maybe that's not an issue?

Because for the majority of cases get_text doesn't depend on the token-specific methods.

Ok, but what about other use-cases? It's not clear to me that one specific use-case justifies creating a complicated API for other cases.

IPython (https://github.com/ipython/ipython/issues/13731) gets pretty clear high priority over other use cases. Some other projects which only use the get_text methods are Grist, icecream, birdseye, heartrate, snoop, friendly-traceback, varname, and python-devtools. I actually don't know any real cases of the token-specific methods being used and would be curious to see some.

Oct 01 '22 16:10 alexmojaki

I appreciate that ipython and others are important consumers, I'm not at all saying that the lazy evaluation approach shouldn't happen. Faster asttokens is great for everyone, ipython included :)

What I'm questioning is why the changes to enable that need to be made in a way that complicates things for other potential users when there's a fairly simple tweak that would simplify both the API for all consumers and the internal logic of this package, yet has no adverse effects on ipython & friends?

I actually don't know any real cases of the token-specific methods being used and would be curious to see some.

tuck uses next_token, prev_token, node.first_token and node.last_token pretty extensively and recently started using get_token too. The addition of end-positions in Python 3.8 is unlikely to impact tuck as it uses the tokens for quite a bit more than just the start/end range of an AST node. Because it works to edit the source, it has no use for get_text (aside perhaps from when debugging it).

While tuck is unlikely to be adversely affected by the laziness changes (it will always want the tokens, so as long as there's a way to get them parsed early enough it'll be happy), it would be great if it didn't need to do assert asttokens.tree is not None just to placate the type checker.

Oct 01 '22 16:10 PeterJCLaw

it would be great if it didn't need to do assert asttokens.tree is not None just to placate the type checker.

While I'm hesitant to remove/deprecate the tree=None, parse=False case which hypothetically someone might be relying on, or otherwise change the behaviour, I'd be fine with adding a cast to make the type of .tree more realistic.

Oct 01 '22 17:10 alexmojaki

For clarity -- by "let everything be lazy" I mean fully leaning into that. ASTTokens(source, tree=None, parse=False) becomes invalid (likely we deprecate the parse argument entirely) and the constructor now never does any tree construction or tokenisation. There's then no need for a init_tokens argument (since it's now always False) and access to .tree or .tokens would initialise those respectively and everyone gets the benefit of not doing work they don't need, not just the one specific use-case.

This would break code relying on node.first_token and .last_token.

Oct 01 '22 17:10 alexmojaki

it would be great if it didn't need to do assert asttokens.tree is not None just to placate the type checker.

While I'm hesitant to remove/deprecate the tree=None, parse=False case which hypothetically someone might be relying on, or otherwise change the behaviour, I'd be fine with adding a cast to make the type of .tree more realistic.

Indeed, I'm not suggesting that this wouldn't need to be treated as breaking/have deprecation notices/etc. For now I think it's preferable that the type is correct even if slightly inconvenient. My API proposals here are deliberately somewhat blue-sky to see whether there's something useful & consistent that can be worked towards.

For clarity -- by "let everything be lazy" I mean fully leaning into that....

This would break code relying on node.first_token and .last_token.

Sorry, yes, obviously going full-lazy here isn't compatible with use-cases that want everything upfront (likely including much existing code) -- hence my suggestion of the lazy parameter:

Probably mark_tokens remains available for users that just want to mark an AST rather than using the ASTTokens class, but otherwise it too is only called as needed. Alternatively or additionally, maybe there's a lazy: bool argument for users who want ASTTokens to do everything upfront for whatever reason.

or previous suggestion of a separate LazyASTTokens class.

Oct 01 '22 17:10 PeterJCLaw

Could you expand on why you feel that lazy evaluation in the other cases (e.g: get_token_from_offset) is a bad thing if it's being supported for other cases?

Here's a preview of using the new init_tokens=False: https://github.com/alexmojaki/stack_data/pull/41

While stack_data generally only needs node position info, it does have some code that accesses ASTTokens.tokens for the sake of optional token-specific functionality meant for https://github.com/cknd/stackprinter/pull/23. To my knowledge nothing actually uses this. There were also uses of first_token and last_token in the code which are removed in that PR. All this makes it tricky to just search the code for token and keep track of everything. So if something were to lazily initialise tokens just by calling .tokens or something it would be hard to notice. The lack of errors helps me to be confident that init_tokens=False is having the desired effect. I'm not 100% confident that get_text isn't messing things up, but that's OK.

I also think it would be weird to have some token-specific features which can work lazily but others (.first_token and .last_token) which can't.

It's worth considering that after get_text is called (and the tokens initialised) then the other token-specific methods will work, adding to the potential confusion from this semi-lazy design.

Solving this would be a nice-to-have but I'm OK with leaving it for now.

Oct 01 '22 18:10 alexmojaki

So if something were to lazily initialise tokens just by calling .tokens or something it would be hard to notice.

Well, yes. This is exactly my point. Calls to get_text will lazily initialise tokens (including .tokens) and it will be easy to miss this. Code which uses the token methods after such a call will happen to work, but code on a different codepath (perhaps even the same block of code called differently) wouldn't.

The lack of errors helps me to be confident that init_tokens=False is having the desired effect. I'm not 100% confident that get_text isn't messing things up, but that's OK.

For clarity: from what I've seen of #93 I think the code there is correct and is going to work, my concern is that it is creating a behaviour which is confusing while also adding complications to this package's source.

I also think it would be weird to have some token-specific features which can work lazily but others (.first_token and .last_token) which can't.

Well, yes. This is broadly in agreement with my original suggestion of reworking the API to better clarify what the capabilities of each case are.

Oct 01 '22 18:10 PeterJCLaw

Well, yes. This is exactly my point. Calls to get_text will lazily initialise tokens (including .tokens) and it will be easy to miss this. Code which uses the token methods after such a call will happen to work, but code on a different codepath (perhaps even the same block of code called differently) wouldn't.

I acknowledged this in my last comment. This edge case isn't ideal but it's a small enough problem that I'm not currently motivated enough to deal with it. For 3.9+, the only node types that can lead to this situation are withitem and arguments. So get_text is still extremely unlikely to lazily initialise tokens, while token methods would be guaranteed to.

Besides, to me the solution to this would be to add extra strictness, essentially throwing an error even when tokens are present because of the initial init_tokens=False. It doesn't sound like adding that would satisfy all your desires or resolve all your concerns.

Oct 01 '22 18:10 alexmojaki

I think we may need to agree to disagree here.

Fundamentally the change in #93 breaks static analysis of asttokens consumers (as well as internally) and has the potential to cause developer confusion due to its sometimes-lazy-sometimes-error behaviour. I disagree that this is an edge case and instead see this as something that's worth putting effort into avoiding (if this were my project the fact that it would simplify the code in the ASTTokens class would also be attractive).

Adding hard errors to places which aren't in the blessed path would help alleviate some of the issues it creates, though not all, and does nothing to aid static analysis (at least within the limitations of Python's current annotations system [^1]).

Having a separate class which offers a lazy wrapper around the limited functionality that is desired would preserve the majority of the static analysis capabilities [^2] and make it much clearer to consumers that they're getting a different set of behaviours than what the existing ASTTokens class provides. It also allows consumers flexibility, rather than pushing a particular use-case.

In case this less intrusive approach is of interest to the maintainers, I've put together a quick demo in https://github.com/gristlabs/asttokens/pull/98. That PR is based on #93, but removes the sometimes-lazy-sometimes-error behaviour from ASTTokens in favour of an always-lazy helper class which has much clearer limits on its functionality.

[^1]: There might be something you could do with __init__ overloads constructing to a generic parameter on the type which indicated that the members had different signatures, but I don't know of a way to use that to remove members and I suspect it would be pretty ugly to implement. [^2]: Type checkers already can't see node.first_token and node.last_token, so there's no real loss to static analysis that they're added lazily.

Oct 02 '22 17:10 PeterJCLaw

asttokens asttokens copied to clipboard

Consider a clearer optional-functionality API

asttokens
asttokens copied to clipboard