asttokens
asttokens copied to clipboard
ASTText class that doesn't require tokens
Following up on https://github.com/gristlabs/asttokens/pull/92.
Closes #97.
- Adds a new class
ASTTextthat:- Allows using
get_text[_range]while avoiding the performance cost of dealing with tokens for most node types and newer Python versions. - Supports nodes inside f-strings for newer Python versions.
- Lazily initialises
treeif one isn't passed, soparse=Trueisn't needed.
- Allows using
- Adds a new method
get_text_positionsto bothASTTextandASTTokenswhich returns start and end (lineno, col_offset) pairs, as an alternative tonode.first/last_token. - Adds a boolean argument
paddedto all 3get_text*methods on both classes which matches the behaviour of the same argument inast.get_source_segmenti.e. includes leading indentation for multiline statements. However the default is True (instead of False as inast.get_source_segment) in the existing methodsget_text[_range]for backwards compatibility. For the newget_text_positionsthere is no default to force the user to stop and think. It can't beFalsebecause that would be confusing compared to the existing methods, butFalsematches the behaviour of usingnode.first/last_tokenwhich the method is meant to replace.
Just a thought here -- would using something like https://pypi.org/project/cached-property/ make this easier given the goal of lazy evaluation? That would sidestep the need for various places to check whether the tokens have been initialised (I'm imagining that consumers would just access self.tokens or self.token_offsets which would facade out the lazy-parsing side of things) and make the type checkers happier too.
Not sure how big an issue it would be to take the extra dependency for Python < 3.8 though? (functools learns cached_property in 3.8)
If that approach does work, given the conversation at https://github.com/gristlabs/asttokens/issues/94#issuecomment-1252862533 would it make sense to make the AST similarly lazy (rather than potentially None)?
The tokens property isn't lazily initialising _tokens, it's asserting it for the sake of type checkers. Using .tokens when they're not initialised is meant to throw an error.
Ah, ok. I think I'd misunderstood from the discussion on that issue.
Given that goal I would repeat the encouragement I made on that issue -- of exploring having variations of ASTTokens for the no-tokens and no-tree cases so that the default case requires both. This would greatly simplify the type checking and in turn make consumer code clearer. I'd be happy to put together a sketch of that if you're open to that approach.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is atok = ASTTokens(source, parse=True) and I don't see much of a problem that needs improving.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is
atok = ASTTokens(source, parse=True)and I don't see much of a problem that needs improving.
Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.
Can you show a sketch of what the consumer code would look like and what the difference/advantage would be? Right now I think the default case is
atok = ASTTokens(source, parse=True)and I don't see much of a problem that needs improving.Sure, though I'll take it to a separate issues to avoid taking this PR too far off topic.
https://github.com/gristlabs/asttokens/issues/97
@dsagal the names I chose initially are not great. What do you think of replacing unmarked and init_tokens with use_tokens? So the constructor and get_text methods would both have use_tokens=True by default, and the method .init_tokens() would be .use_tokens(). supports_unmarked could be supports_no_tokens.
Here's a comparison to just before phase 1 (#92) to show the whole picture which can be helpful: https://github.com/gristlabs/asttokens/compare/2.0.8...unmarked2
Found a problem while testing this in futurecoder. I had some code that relied on the fact that .get_text(node) returning something non-empty means that node.first_token exists. The new way of supporting f-strings means that this implication no longer holds, even without setting init_tokens=False. This led to an AttributeError. I don't see anything we can do about that here, so that's a small break in compatibility that probably means a new major version is needed.
Two more f-string problems:
.get_text(node)suddenly working can just break things in other ways. Previously I was excluding nodes from a particular set whenget_textdidn't work. f-string contents suddenly appearing in the set was a problem. In particular, I was usingasttokens.util.replaceto modify some source code, and it assumes non-overlapping ranges, but now the inner node ranges were overlapping with the full f-string range. The fact that I was using anasttokens.utilfunction here is essentially a coincidence, things could just break in non-asttokens related ways. In theory I suppose there could be another boolean arg for compatibility to continue ignoring f-string nodes, but that seems a bit extreme.- The direct children of f-strings (but not the regular expression nodes within) have broken positions and need further special handling.
While rebasing #98 I hit a test failure which I think is present in this PR too -- the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called. This seems to happen in the example source because of the function definition there, however commenting that out (the def foo on line 16--17 in tests/test_unmarked.py) causes test_unmarked to fail on the f-string handling.
The specific error comes from the assertion that atok.get_text for f-strings return empty text: self.assertEqual(atok_text, "", node).
I'm actually not sure why the test is asserting that that value is empty given that the text which would otherwise be returned appears to be correct to me (it matches what python's AST returns, which seems fine; removing the conditional block gets the tests passing for #98).
I've explained the reason for the conditional block in https://github.com/gristlabs/asttokens/pull/98#discussion_r985298862. However you're right about this:
the f-string handling in atok.get_text implicitly relies on mark_tokens having run before it's called
Initially this was intentional and fine because the only f-string handling was _in_f_string (now called _dont_use_tokens) saying that even though we have tokens you still shouldn't use them.
However it was a mistake to also add _broken_positions in the same place, because indeed this means that marking tokens changes the behaviour of get_text. The tests actually do reflect this now but I forgot about this point.
So another solution is needed if we want the 'broken' inner nodes of f-strings to return empty text instead of the whole f-string. There's 3 types of these nodes:
- FormattedValue: easy, this type seems to always be broken.
- Str/Constant: quite easy if we just notice that the returned text starts with
forrf(orFetc.) instead of"or'. - JoinedStr: this is for the format spec. This is a problem, since the returned text is the full f-string which is the correct type. So I think we either have to:
- Iterate through the whole tree on initialisation to mark nodes. Not great since until now we managed to avoid paying any similar upfront cost.
- Parse the returned text and see if it matches the node we have. Makes each
get_textcall potentially significantly more expensive, but only for JoinedStr nodes.
@dsagal the names I chose initially are not great. What do you think of replacing
unmarkedandinit_tokenswithuse_tokens? So the constructor and get_text methods would both haveuse_tokens=Trueby default, and the method.init_tokens()would be.use_tokens().supports_unmarkedcould besupports_no_tokens.
Sure! I haven't followed the full discussion on this or the related issues. I'd be fine with different names for params/methods. Having separate classes for different semantics seems reasonable too, especially if that helps with typings.
@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here, and a summary of the benefits?
@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.
@dsagal aside from the broader API questions, it'd be helpful if you could look at my last comment above about f-strings, including the comment I linked to showing the general problem, in case you have any good ideas for solutions, or a reason why it's not worth worrying about.
It seems to me like f-strings are an area similar to where the AST itself was when asttokens was created: the built-in functionality has bugs that make it insufficient, and to work around them, asttokens needs to do some manual parsing (searching for curly braces, :, !). Sorry, I don't have anything more helpful to suggest...
Well since you noticed that arbitrary types of expression nodes inside f-strings can have wrong positions, I think that settles the fact that walking through the tree to mark various nodes is necessary.
@PeterJCLaw please can you write an example of what using the API in #98 would look like, compared with the API here,
I think the tests in #98 cover this? Essentially construction is similar to that of plain ASTTokens, however usage of the lazy class is limited to the functionality that doesn't "obviously" need tokens.
I'm not really expecting consumers to use the LazyHelper.asttokens property, though they could if they wanted (and would get what they expected in return).
and a summary of the benefits?
The main benefits as I see them are:
ASTTokensandLazyHelperhave useful static analysis (which the consequences ofASTTokens(init_tokens=False)does not), allowing both type checkers and editor completions to be essentially correct without additional effort (including any changes necessary due to opting in to the functionality in this PR #93)- The semantics of what is possible with each class are consequently clearer as they're static (indeed this was how the f-strings bug was highlighted), resulting in clearer behaviour for both consumers and maintainers of this package (and in turn hopefully easier maintenance)
- The semantics of _all_uses of
LazyHelperare much more obvious in terms of their impact onnode.{first,last}_token-- specifically that consumers should not use those added attributes on the grounds that they might not be present (but that's what consumers are opting in to by using this class) - (
LazyHelperalso happens to allow for the case of lazily evaluating the AST too, meaning other use-cases can choose to benefit from laziness without this functionality further complicating theASTTokensinitialiser) - Consumers that really want to access the additional
ASTTokensfunctionality can still do so (and the point at which the lazy-loading happens will be obvious to them as a result of their needing to access an indirect property to do so) - Easier maintenance as logically separate code is in separate places (including e.g: the
TextRangeBaseclass) and in particular the optimising/special-casing being added for the lazy case are limited to one specific place in the code (theLazyHelperclass)
The potential drawbacks I can see are:
- To opt in to the new behaviour consumers need to use a new class:
- potentially needing more changes than just the initialiser argument (minor IMO given that they'd be changing code anyway)
- counterpoint: most additional changes they need to make will be highlighted to them by a type checker (if they're using one)
- Some mental overhead during maintenance from having the logic split between multiple classes
With @PeterJCLaw's changes:
- The
ASTTokensclass only uses tokens and doesn't use the new method at all. - This means that it can't handle f-strings at all again.
- The previously mentioned compatibility problems are gone.
- The special attribute
_dont_use_tokens(from earlier in this PR) is no longer needed and has been removed.