jruby icon indicating copy to clipboard operation
jruby copied to clipboard

Fast builtin validation experiment

Open headius opened this issue 3 weeks ago • 10 comments

Experimental implementation of fast built-in method invalidation bits, similar to CRuby's implementation.

See #9119.

headius avatar Dec 08 '25 16:12 headius

First run on my modified code of your builtin speed up experiment @headius.

I haven't updated Jruby on my workstation yet, and im also not running on a beefy setup.

I added to the benchmark run the overhead stack of running wsl Ubuntu on win11 pro, so it could be calculated on a native linux machine.

Screenshot 2025-12-08 194121.png

Screenshot 2025-12-08 194136.png

Core changes:

Builtins.java (new): Core flags, mappings, and check methods. Static init for class/method maps. Expanded with all CRuby ops + Range ext. Added grouped checks for Integer/Float/String/Array/Hash/Range/etc.

RubyRange.java (modified):

include_p/member?: If checkRangeInclude(), try fastIncludeCheck (pure arith/comp for int/float ranges, using checkIntegerCompare/checkFloatCompare). Fallback to original impl.

  • cover_p: Similar fast path with direct comps (strings too, via checkStringEquals).

  • op_eqq (===): Reuses fast include for case-when opts.

  • min/max: Fast return begin/end if checkRangeMin/Max (handles exclusive ints with checkIntegerMinus).

  • Integration.java (snippets for existing files):

  • Ruby.java: Add builtinBits field + getBuiltinBits()/invalidateBuiltin() (calls Builtins.invalidate).

  • ThreadContext.java: Add public final builtinBits ref (init from runtime).

-RubyModule.java: Call invalidateBuiltin in putMethod (post-profile).

  • RubyObject.java: Swap old isBuiltin() to new checks in fastNumEqualInternal.

  • Benchmark_test.rb (new): Full script for repro. Includes baselines (int arith/method calls/array access) for WSL/native calibration, various include? cases, monkey-patch test, and raw comp baseline.

Testing/Verification

  • Benchmark confirms opts engage (times drop) and disengage on monkey-patch (via alias/redefine).

  • Specs: Core suite passes (ran locally). Added flag shouldn't break anything—it's opt-in per-method.

  • Edge: Handles exclusive ranges, floats/strings (in cover?), non-numeric (fallback), redefs on deps (e.g., Integer#< redefined trips compare checks).

I can drop the new code in a new repo for you to clone so you can test and evaluate between machine and env differences if you're satisfied with these results.

CufeHaco avatar Dec 09 '25 01:12 CufeHaco

@CufeHaco Yeah put the code somewhere and I'll have a look. I'm sure there's lots of places that CRuby has added these built-in method checks over the years, where JRuby only has them in a few specific places. Any places where we need to make dynamic calls to standard core methods would candidates.

headius avatar Dec 09 '25 22:12 headius

I just dropped the code into a new repo. Give me just a min and I'll post the link.

I also started a pure jruby win32 and .net compatible layer implementation last night if you want to start a newnthresd for it. It should take the headache of dealing windows for jruby. It shouldn't effect the core internals at all, just jruby <->windows api.

CufeHaco avatar Dec 09 '25 22:12 CufeHaco

@headius heres the link

https://github.com/CufeHaco/Jruby10-builtin-test

The benchmark.rb is ready to go.

CufeHaco avatar Dec 09 '25 22:12 CufeHaco

OH! My bad, you may need to comment out the wsl benchmark, i dont remember if I added a condition handler to it. I dont think i did. Otherwise its good to go.

@headius I apologize for thst oversight. Dad life and all.

CufeHaco avatar Dec 09 '25 22:12 CufeHaco

I'll start looking in

RubyFixnum.java RubyArray.java RubyString.java RubyHash.java RubyObject.java

And start searching this evening and see if I can get a punch out list for a debug report.

CufeHaco avatar Dec 09 '25 22:12 CufeHaco

@CufeHaco The implementation looks about right, but is there a reason you haven't done it as a diff against JRuby directly? I believe you are using some LLM tooling for this (which is fine), but eventually it will need to be a PR against JRuby. Might as well start from a PR and we can refine it from there. This is a pretty benign change to introduce, since it's mostly new code and the existing checks can be migrated incrementally.

The implementation of Builtins does seem to have expanded like I expected, so that much is probably good to go.

A good way to find places we're using the existing CallSite-based check would be to look for calls to DynamicMethod.isBuiltin.

headius avatar Dec 10 '25 07:12 headius

@headius Just to clarify, the repo at https://github.com/CufeHaco/jrubytest-jruby-jep380-prototype/tree/CufeHaco-JEP-380-full-prototype is already a fork of jruby/jruby with the changes applied to the actual source files (DynamicMethod.java, InterpretedIRMethod.java, etc.).

I'm still learning Java and GitHub's PR workflow isn't intuitive for me yet, so I've been working in an isolated fork where you can review the complete diff. The code is structured to drop into JRuby with minimal changes.

If the implementation looks right, I can work on opening a formal PR against jruby/jruby main - just need some guidance on the GitHub mechanics to make sure I do it correctly.

CufeHaco avatar Dec 10 '25 18:12 CufeHaco

Same with the jruby10 repo. I write it then hook it into jruby for testing and benchmarks, then I drop a repo (because I know I can do that) for you to review. That way I can get feedback on what and where to improve code and skill wise. Im learning this stack as I go @headius

CufeHaco avatar Dec 10 '25 18:12 CufeHaco

@headius the past couple of days ive been coding a benchmark scraper of the all the core java files and builtins. So far I got around 969 total points for optimizing. Here are the ones that I wanted you to see:

  • dig_misc - 188K ops/sec <- biggest win potential
  • write - 270K ops/sec
  • op_and - 423K ops/sec
  • each - 475K ops/sec
  • to_s - 487K ops/sec

The scraper works just like rubian and the rubytk patch. It dynamically goes though the core files and runs benchmarks on everything and saves the data.

I did a couple just like the original to get some numbers, and we are matching and or exceeding in some cases, with arrays, hashes, and strings.

Instead of hard coding and having to write in at each optimization, we could do just 1 Java file that hooks in.

What are your ideas and how would you like to proceed on the matter? Im sorting though the data I have now so you can review and compare.

Screenshot 2025-12-11 144449.png

Screenshot 2025-12-11 144541.png

Screenshot 2025-12-11 144612.png

CufeHaco avatar Dec 11 '25 20:12 CufeHaco

@headius the scraper works great, but I’m running into some friction getting the Java syntax exactly right in the codegen phase.

Conceptually I’m using the same base I sent you before, but expanding it via Ruby’s file handling and string interpolation. Ruby is only being used to author Java source. The runtime still sees normal static code.

For example, generating builtin flags like:

File.write(path, %Q^
    public static final int #{const_name} = 1 << #{bit};
^, mode: "a")

The idea is to avoid hand-coding or relying on LLMs: let Ruby scrape JRuby core, then expand the builtin speedups mechanically wherever they apply. The only tricky part is having the right Java templates so the generated code lands in the correct places.

After that, the remaining work is just peppering:

Builtins.invalidate(Builtins.INTEGER_EQ)

where appropriate.

I may not be explaining it perfectly. I’ve been pretty deep in JRuby internals and a bit burned out, but the intent is straightforward: Ruby writes the Java based on the java template we provide and ruby adds in the real JRuby constant/method names from the scraper, and JRuby runs it as static code. We can make this surgical.

CufeHaco avatar Dec 13 '25 21:12 CufeHaco

@headius Last night I watched one of your older presentations for ruby 9000 to give me a break from looking at code. It was the one you were talking about JVM pressure points, and issues you've had, and then you talked about the byte arrays. Thats when it hit me.

We can directly manipulate the byte arrays dynamically, and cache the patterns. This should also help with boot time as well if im correct. This is where I'm definitely going to need your expertise on the matter. This morning I dug i to it more, and threw my idea between some llms to get a concept going. The code here is pure llm generated for a blueprint of what im thinking, I just need to know the best way to implement the idea with the current builtin experiment as well as future experiments. Its based on my own personal algorithms ive been building, so I understand the logic, I just need to know where to use it. Again, im using this code example to try to convey my idea, as well as a personal blueprint to understand Java and the JVM better.

If you like the idea, would you mind coding a skeleton script? My biggest hurdle is writing Java syntax itself. I can fill in the logic if i have a blueprint to go by. Because this is llm generated, I cant trust the alignment with this ongoing PR. Its just me trying to deliver an idea. RBM is just a short hand i use for recursive bit/binary mapping. Even if it doesnt fit the current builtin experiment, i think its something worth exploring down the road.

package org.jruby.util;

import org.jruby.util.ByteList;

import java.util.concurrent.ConcurrentHashMap;

/**
 * Experimental byte-array pattern engine — build once, match many.
 * Core idea: precompute pattern structures from ByteList bytes for fast reuse.
 * Designed for boot workloads where the same patterns are checked repeatedly.
 */
public class ByteArrayPatternEngine {

    public static final int MAX_PATTERN_LENGTH = 62; // For bitap include?; longer fallback

    // Cache built patterns (immutable key)
    private static final class PatternKey {
        final byte[] bytes;
        final int hash;

        PatternKey(ByteList bl) {
            int len = bl.getRealSize();
            this.bytes = new byte[len];
            System.arraycopy(bl.getUnsafeBytes(), bl.getBegin(), bytes, 0, len);
            this.hash = java.util.Arrays.hashCode(bytes);
        }

        @Override public int hashCode() { return hash; }
        @Override public boolean equals(Object o) {
            if (!(o instanceof PatternKey)) return false;
            return java.util.Arrays.equals(bytes, ((PatternKey)o).bytes);
        }
    }

    // Built pattern structure
    private static class BuiltPattern {
        final byte[] bytes;
        final long[] bitapMask; // null if too long or not needed
        final int length;

        BuiltPattern(ByteList bl) {
            this.length = bl.getRealSize();
            this.bytes = new byte[length];
            System.arraycopy(bl.getUnsafeBytes(), bl.getBegin(), bytes, 0, length);

            if (length <= MAX_PATTERN_LENGTH && length > 0) {
                this.bitapMask = computeBitapMask(bytes);
            } else {
                this.bitapMask = null;
            }
        }
    }

    private static final ConcurrentHashMap<PatternKey, BuiltPattern> PATTERN_CACHE = new ConcurrentHashMap<>();

    // Build or retrieve precomputed pattern
    private static BuiltPattern getBuilt(ByteList pattern) {
        if (pattern.getRealSize() == 0) return null; // special case
        PatternKey key = new PatternKey(pattern);
        return PATTERN_CACHE.computeIfAbsent(key, k -> new BuiltPattern(pattern));
    }

    // --- start_with? using built byte array ---
    public static boolean startsWith(ByteList haystack, ByteList prefix) {
        BuiltPattern built = getBuilt(prefix);
        if (built == null) return true;
        if (built.length > haystack.getRealSize()) return false;

        byte[] hayBytes = haystack.getUnsafeBytes();
        int hayBegin = haystack.getBegin();

        for (int i = 0; i < built.length; i++) {
            if (hayBytes[hayBegin + i] != built.bytes[i]) return false;
        }
        return true;
    }

    // --- include? using built bitap mask (RBM core) ---
    public static boolean includes(ByteList haystack, ByteList needle) {
        BuiltPattern built = getBuilt(needle);
        if (built == null) return true;
        if (built.bitapMask == null) {
            // Fallback for long patterns
            return haystack.indexOf(needle) != -1;
        }

        long state = ~0L;
        long activeMask = (1L << built.length) - 1L;
        long deadState = activeMask;

        byte[] bytes = haystack.getUnsafeBytes();
        int begin = haystack.getBegin();
        int end = begin + haystack.getRealSize();

        for (int i = begin; i < end; i++) {
            int b = bytes[i] & 0xFF;
            state = ((state << 1) | built.bitapMask[b]) & activeMask;
            if ((state & 1L) == 0) return true;
            if (state == deadState) return false;
        }
        return false;
    }

    private static long[] computeBitapMask(byte[] pattern) {
        int len = pattern.length;
        long[] mask = new long[256];
        java.util.Arrays.fill(mask, ~0L);

        for (int i = 0; i < len; i++) {
            int b = pattern[i] & 0xFF;
            mask[b] &= ~(1L << i);
        }
        return mask;
    }

    public static void clearCache() { PATTERN_CACHE.clear(); }
    public static int cacheSize() { return PATTERN_CACHE.size(); }
}

CufeHaco avatar Dec 14 '25 22:12 CufeHaco