tolerant-php-parser
tolerant-php-parser copied to clipboard
Adopt new Token representation
As discussed in HowItWorks.md#notes, there are ways that we can significantly reduce memory usage by moving away from objects for Tokens. This issue tracks progress on that work.
Not sure I understand how you want to handle this on 32-bit. If you want to drop the Token object entirely (rather than just reduce the number of properties on it), that means anything storing a token will now become two properties, right? Given that tokens are directly stored inside node properties, that seems problematic.
@nikic - yep, the idea is to completely get rid of the object representation, so in 32-bit mode, it would turn into two properties. And yes, it will most certainly not be pretty when you look under the hood, but hopefully we can smooth over the API using __get and __set methods.
That said, while I've run some promising experiments, I haven't gotten around to fully prototyping the idea, so I could be overlooking something. In particular, I'd like to better understand the overhead of __get/__set.
Thoughts?
__get and __set are very slow. From a quick check on 7.1 __get is ~5x slower than a declared property lookup and ~2.5x slower than a normal method call. On 7.0 it's ~6x slower than a property lookup. (The reason is, to a large part, that __get has to reenter the VM.)
@nikic darn... yeah, that's what I was worried about... that's worse than calling into a native extension. And just so I can fully wrap my head around this - why is it that __get and __set introduce the same memory overhead as a property, unlike normal methods? That was another surprising finding.
Also, re: native extensions... how much practical complexity would that introduce if we chose to go that route? We want the parser to be able to run on a variety of machine configurations, so just how difficult might distribution be, assuming we want to ship on Linux + Mac + Windows and support PHP 7.X? Are there usually a lot of breaking changes between versions of PHP, or is it relatively stable?
why is it that __get and __set introduce the same memory overhead as a property, unlike normal methods?
Magic getters have per-property recursion guards, which means that if you're in __get('x') then an access to $this->x will access the actual property on the object instead of recursing. Accessing $this->y would still call __get('y'). Similarly $this->x = 42 would still call __set('x', 42), as this is gated on both the property name and the magic method type. In PHP <= 7.0 this was managed using a hashtable from property names to flags specifying whether that property is currently being __get'ed etc. (These entries are not removed after the magic method call, so the memory is not reclaimed.) In PHP 7.1 an optimization was added to avoid the HT in the most common case where there are no recursive magic method calls. In both cases the additional state is stored in an extra property slot, which is only allocated if __get etc are declared on the class.
As to native extensions: For this type of extension (no interaction with engine internals) the API should be relatively stable for 7.x. If there will be breaking changes, they will be minor.
For me extensions usually work cross-platform without additional work. There is a service that builds extension DLLs for Windows, as Windows users generally can't compile extensions themselves (setting up a Windows extension dev environment is something of a PITA).
For a native extension my concern wouldn't be with the development side of things, but rather with the end-user side. Having a non-core extension dependency is a pretty big hurdle to adoption, so if you go down this route, it would be best to make the extension optional.