javacpp icon indicating copy to clipboard operation
javacpp copied to clipboard

Struct memory allocation is slow

Open zakgof opened this issue 5 years ago • 9 comments

I run a simple benchmark calling window API's GetSystemTime using JavaCpp's built-in windows API wrappers. This code allocates a struct, calls the native API and fetches some field from the struct:

		SYSTEMTIME systemtime = new SYSTEMTIME();
		windows.GetSystemTime(systemtime);
		return systemtime.wSecond();

Profiling shows that the first line takes >90% of the overall execution time

image

I believe that there is some space for optimization here. The same thing implemented with Bridj or JNR outperforms JNI+JavaCpp just because of faster allocation, see the benchmark at https://github.com/zakgof/java-native-benchmark.

Say, with Bridj allocation takes <50% of the overall time:

image

zakgof avatar May 13 '19 11:05 zakgof

Memory allocation with MSVC is known to be slow, that's not really JavaCPP's fault. JNA and BridJ don't use C++ to allocate memory. We could allocate memory the same way for JavaCPP with, for example, Pointer.malloc(), cast it to SYSTEMTIME, and that should be faster. Could you give that a try?

saudet avatar May 13 '19 11:05 saudet

The below code performs indeed much faster:

		Pointer raw = Pointer.malloc(systemTimeStructLength); // precalculated as systemTimeStructLength = new SYSTEMTIME().sizeof()
		SYSTEMTIME systemtime = new SYSTEMTIME(raw);
		windows.GetSystemTime(systemtime);
		return systemtime.wSecond();

image

Now the question is, why not to generate a struct's default constructor implementation with Pointer.malloc instead ?

zakgof avatar May 13 '19 12:05 zakgof

We could, but it wouldn't be C++ :) I think Win32 doesn't throw C++ exceptions though, so we can probably speed this up with a @NoException like here: https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/main/java/org/bytedeco/mkl/presets/mkl_rt.java#L56

saudet avatar May 13 '19 12:05 saudet

Ah, no, we already have @NoException there. One other thing to be careful about on Windows: Memory deallocation is excruciatingly slow when a lot of memory is allocated, so make sure to deallocate as fast as possible. In this case, this will deallocate right away just before return:

try (SYSTEMTIME systemtime = new SYSTEMTIME()) {
    windows.GetSystemTime(systemtime);
    return systemtime.wSecond();
}

saudet avatar May 13 '19 12:05 saudet

SYSTEMTIME is a C struct (with no constructor), and I believe that library users would prefer faster implementation with C rather than a slower one with C++.

I'd suggest modifying the parser to

  • precalculate sizeof() for a struct at parsing time, and let the sizeof() return the precalculated constant (currently it checks the members map which is definitely slower).
  • if a struct(class) has no constructors defined and no parent classes, implement its Java counterpart constructor with malloc. Or, at least provide a new static method:
class SYSTEMTIME extends Pointer {

    private static final long STRUCT_SIZE = 16; // Calculated at generation time

    public static long sizeof() {
        return STRUCT_SIZE;
    }

    public static SYSTEMTIME malloc() {
        return new SYSTEMTIME(Pointer.malloc(STRUCT_SIZE));
    }
}

zakgof avatar May 13 '19 13:05 zakgof

Before we start modifying everything just because MSVC allocation is slow, let's check how the try-with-resources version performs. It should work well enough.

saudet avatar May 13 '19 22:05 saudet

Actually, no, C++ allocation isn't the bottleneck at all here. It's the deallocator registration which is slow. Pointer.malloc() doesn't register any deallocator, so that's why it's fast.

saudet avatar May 16 '19 00:05 saudet

More than half the time seems to be spent by the garbage collector browsing through the doubly-linked list of phantom references. If that's the case, there might not be much we can do about this other than simply not rely on the GC at all. The JDK itself uses doubly-linked lists for its own use of phantom references: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/Cleaner.java https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java BTW, JDK 11 seems to be a lot better at this than JDK 8. Make sure to upgrade your JDK!

saudet avatar Jun 12 '19 09:06 saudet

FYI, starting with JavaCPP 1.5.6, we can now skip all that overhead and get very low latency by setting the "org.bytedeco.javacpp.nopointergc" system property to "true", see https://github.com/tensorflow/java/issues/313.

saudet avatar Dec 03 '21 00:12 saudet