tapioca Stop generating more methods than necessary

Motivation

We used to treat methods that didn't have a source location the same as methods that explicitly were not defined in the current gem. That resulted in Tapioca creating more method definitions than necessary.

We would only skip method generation for a method if the constant it was on was an ignored type (i.e. a built-in type), so that we wouldn't keep redefining methods for built-in types. However, for all other types, especially types that come from other gems, we would just keep on generating all the methods regardless of if they were defined by this gem or not.

Moreover, the source location check was happening at the wrong location, before unwrapping the method signature. Thus, many methods with signatures would not be generated when the previous problem was fixed, since our code would see them as being defined in Sorbet runtime.

Implementation

The fix is to return a more fine-grained result from method_in_gem? which signals yes/no/don't-have-source-location. Based on that we can skip generating don't-have-source-location cases if they are for built-in types and totally ignore the methods that have a source location and are definitely not defined in the current gem.

Additionally, if we try to unwrap the method signature and we get an exception, that means the signature block raised an error. If we continue with the method as is, the source location checks would think the method definition does not belong in the gem (since the method is still wrapped), and we would thus skip the method generation. To avoid that, the signature_for method is now raising a custom exception to signal that exceptional case, so that we can at least continue generating "a" method definition.

Tests

Updated existing tests.

Apr 23 '21 19:04 paracycle

giphy

Aug 09 '24 16:08 mutecipher

Unfortunately, while the CI is green, this work is still not complete, since the implementation ends up removing too many methods from generated RBI files.

In general there are 2 ways in which "gem B" can add methods to a constant Foo from "gem A":

"gem B" reopens Foo and adds a method to it:
```
# gem A
class Foo
  def foo
  end
end

# gem B
class Foo
  def bar
  end
end
```
This PR correctly attributes Foo#foo to "gem A" and Foo#bar to "gem B", and they are generated in the respective gem's RBI file. So, this use-case works properly today against this PR.
"gem B" does not reopen the Foo constant from "gem A" but defines methods on Foo in respond to some mechanism triggered by the loading of constant Foo. For example:
```
# gem A
class Foo
  extend ModuleFromB

  def foo
  end

  add_methods_to_me :bar
end

# gem B
module ModuleFromB
  def add_methods_to_me(*names)
    names.each do |name|
      define_method(name) { 42 }
    end
  end
end
```
In this case, the method Foo#bar will exist on Foo at runtime, but its source location will point to the file in "gem B". Thus, this PR will filter that method from the RBI file of "gem A". But, the method will not be generated in "gem B"'s RBI file either, since the Foo constant is never visited in "gem B". Thus, Tapioca ends up missing the definition of Foo#bar altogether.

To do this properly, we need to attribute such dynamic method definitions to the file require that triggered them, just like how we did with mixin tracking.

Unlike mixin tracking, though, this is much harder to do, since we can't intercept def calls and analyze the backtrace. It won't be enough to intercept define_method calls (even though it will take us some of the way there), since methods can always be defined via class_eval/module_eval/eval using def.

Aug 09 '24 17:08 paracycle

I've taken a stab at fixing the problem above. I have a failing test demonstrating Ufuk's use case and some half-working code. Here's my WIP PR, I'll return to it next week.

Aug 09 '24 21:08 egiurleo

Unlike mixin tracking, though, this is much harder to do, since we can't intercept def calls and analyze the backtrace. It won't be enough to intercept define_method calls (even though it will take us some of the way there), since methods can always be defined via class_eval/module_eval/eval using def.

Does hooking method_added solve that? @paracycle

Aug 11 '24 22:08 amomchilov

Unlike mixin tracking, though, this is much harder to do, since we can't intercept def calls and analyze the backtrace. It won't be enough to intercept define_method calls (even though it will take us some of the way there), since methods can always be defined via class_eval/module_eval/eval using def.

Does hooking method_added solve that? @paracycle

That is exactly what @egiurleo's PR in the comment above is doing, but in general I am afraid it will be brittle since you need one constant to forget to call super from their method_added implementation and you never get your one called.

Aug 11 '24 22:08 paracycle

This is great. Two questions:

Thanks for the review

Did we test this in Core? Does it pass type checking afterwards?

I triggered the build pipeline for testing against Core now.

Do you know if there is a performance hit or memory increase for tracking the methods?

There will definitely be a performance and memory hit, but I have no idea how much. We do extra operations for every method definition and we store their locations, etc, so this is certainly not free.

Jan 29 '25 19:01 paracycle

Ok, tested this against Core. The build pipeline was giving a lot of errors. I debugged against Core and found the problem and pushed a fix on this branch.

I also did time bin/tapioca gem --all timing for both the version of Tapioca on Core and this branch:

The current Tapioca version on Core:

bin/tapioca gem --all  2803.99s user 89.11s system 380% cpu 12:40.88 total

This branch:

bin/tapioca gem --all  2826.87s user 93.45s system 392% cpu 12:23.26 total

So, it seems like we use about the same amount of time to generate all gem RBIs on Core.

I am not sure if the timing diff is real, but if this branch is indeed faster, then I assume that would be because we collect all method definitions once when loading all the gems, and then we always check against that lookup table for any method. That somehow might be faster than looking up the source location for each method as we come across it.

Jan 30 '25 20:01 paracycle