eta icon indicating copy to clipboard operation
eta copied to clipboard

Design a immutable/mutable interface for Java Arrays

Open rahulmutt opened this issue 6 years ago • 66 comments

Java arrays are currently mutable by default but it makes them awkward to work with. As suggested by @GordonBGood, it makes sense to adopt an interface similar to the array package: http://hackage.haskell.org/package/array-0.5.3.0/docs/Data-Array-MArray.html

rahulmutt avatar Jan 23 '19 06:01 rahulmutt

@rahulmutt, yes, I'm glad you see this as the way forward.

Would you like to work on this yourself or would you like to "farm it out"?

If you would like someone else to work on it, where do you see the changes being made? One way would be to make further patches to the array packages we get from hackage to just redirect all the primitive calls to the equivalent Java calls just as are currently made inside the current Java.Array module; that may be the best way as then the array package can be used when required just as it can be optionally done now, just that by using the Java array's as primitives should make it faster.

If we do that, do you think there will be any differences in performance using the standard ST monad instead of the Java monad? Looking at the code I've managed to find so far, the primitives would appear to do the same thing.

If this works out as to speed, I really don't see that there would be a need for the Java.Array module at all other than for backwards compatibility, but of course FFI is still necessary and one could still create a JArray of objects just as shown there. However, as one can only use pre-evaluated lists (and primitives) to transfer information out of such array's, I don't see them as very useful (as discussed previously in issue #934).

GordonBGood avatar Jan 23 '19 07:01 GordonBGood

I have my hands full with direct interop, so I personally won't be able to work on this anytime soon and I would be happy to assist you in implementing this.

I think patching the array package is the way to go. If anything, the ST monad should be a tad faster - it takes and returns a void value which compiles down to nothing and the Java monad shuffles around a target object value which we have yet to optimize at the bytecode level.

Java.Array will still have its use. Direct interop will assign a canonical type signature for a Java method and we need to be able to assign an Eta type to a Java array. Perhaps we can add conversion functions that can translate the types from Java.Array to those types in the array package.

rahulmutt avatar Jan 23 '19 07:01 rahulmutt

@rahulmutt, okay, we are agreed on what looks to be the best way forward and I would like to contribute if I can.

I'll take your word for it that Java.Array still has its uses (other than for backward compatibility). As to conversion functions, direct conversion might be somewhat difficult as it is hard to produce types out from under the Java monad without either some "under-the-covers" magic trickery or some monad magic such as transformers, interpreters, etc. and the latter aren't very good for performance. The fall-back conversion method is just to use what is already there: converting a JArray to an Eta list (even though not so good as it will be pre-evaluated as it produces it from the last element), and of course we can then make an IArray (and from there a MArray) by using the constructors using lists. It depends what use you see for the "old style" JArray as to what might be worth doing. If you see these facilities being often used, it might be better to do some primitive manipulation in copying or cloning the underlying Java array for a Java Array to/from an underlying Java array in the array package. One question would be where to do that, as neither the Haskell array package nor the Java.Array module will be loaded unless necessary - which then makes sense to use an available default structure such as a list as a conversion vehicle.

So to get started I modify the array patch? Is there an easy way to make changes to the patch, test them, and loop just using etlas?

GordonBGood avatar Jan 23 '19 08:01 GordonBGood

If you do make the change to use Java arrays inside of array, direct conversions can be made pretty efficient by using unsafeCoerce correctly because the underlying representation will be the same regardless of whether it's immutable or mutable. I can take care of that part.

Instructions on how to submit a patch are here:

https://github.com/typelead/eta-hackage#patching-a-library

rahulmutt avatar Jan 23 '19 10:01 rahulmutt

In order to get started on this, I will need a little help from you, I'm afraid.

Although I've done quite a few changes and PR's on Github by forking the repos, cloning them locally, creating a branch, committing the changes desired to that new branch, pushing the commits to my forked repo, then doing a PR based on this, it seems that this Patch process for GHC may be different.

If I understand correctly, the steps to proceed are as follows:

  1. I need to clone the appropriate GHC array version tag locally (looks like 0.5.2.0 is as high as we go, as 0.5.3.0 explicitly says support for GHC < 8.0 has been removed) in order to get a copy of the array model source code
  2. After making a backup of the cloned local copy of the array package source, make changes to one of my local copies to add the capabilities we require.
  3. Create a "patch" file based on the "diff" between the original array package and my modified one in yet another directory; I have found tools that do this.
  4. Then it looks like I can either use etlas with an option "--patches-dir=DIR" to tell it to get the patch from that new directory to load a new version using those patches or backup and change the patch file(s) in my local clone of the eta-hackage repo at "%USER_HOME%\etlas\patches\patches".
  5. Once I'm happy with how it works as per testing, I submit a PR for the changed patch file(s) to the eta-hackage repo.

Do I have it right?

A few questions:

  1. How do I clean old versions of the patched array package from by local caches when I want to try something else? I see that there are quite a few things related to it in my "%USER_HOME%" directory, including at "%USER_HOME%\etlas\eta-0.8.6.4\array-0.5.2.0-Cz0W86mUCERJBKUYgo5Y3t", and as well there are ".tar.tz" package compressed files for each version of eta that is on my machine such as at "%USER_HOME%\etlas\binaries\cdnverify.eta-lang.org\eta-binaries\eta-0.8.6.4\packages"; these contain the ".hi" compiled interface files for the packages. I'm afraid to just delete these as there is may be some database reference to them on which etlas may depend, else the database may be broken
  2. Is there a way to use an etlas command to clean these in order to load anew?
  3. BTW, how does one cause etlas to remove old versions of eta from the local installation. I assume it may be something similar?

If you can give me just a few pointers to get started, the actual work including testing isn't likely more than a few days, as the whole array package isn't that large and we will be changing only a small part of it.

Of course, the very easiest thing to do might be to work at an even lower level and just change the GHC.Prim hook-ups related to primitive arrays, butI understand from GHC.Base that no real file exists; I understand that these are hook-ups that the compiler just "does" and to change hook-ups for instance for primitive byte array ops would require bigger changes? If this were possible, it might mean very few patches would be required at all. But maybe that wouldn't be so good, as the Haskell Arrays only deal with byte arrays and deal with different primitive sizes "primitively" by just unsafe casting, so it wouldn't work so well? I'll start by the patch approach.

GordonBGood avatar Jan 23 '19 11:01 GordonBGood

Here are the steps:

etlas get array

will get the latest version of array supported by Eta and Etlas will initialize & commit a Git repository at the state at which it was downloaded from Hackage and will apply any existing patch and stage the changes for you automatically (just do cd array-0.5.2.0 && git status to observe it).

Now all you have to do is work on your changes, and once you're satisfied, just commit them and use the git format-patch command as stated in the steps on eta-hackage to create a patch file for your changes.

Now since array doesn't have a test suite, you'll probably want to run your own programs to verify your patch works. You can do so by creating a new Etlas project, add dependency on array in your .cabal file and create a cabal.project file like so:

packages: . $PATH_TO_ARRAY

where you replace $PATH_TO_ARRAY with either an absolute or relative path to folder where you modified array exists. When you do etlas build, it will automatically pickup the local version of the package and use that instead.

rahulmutt avatar Jan 23 '19 12:01 rahulmutt

Any time you make changes to your local array and run etlas build in your test project, it will automatically recompile the changes and use the up-to-date version.

Also, we did consider using byte[] directly when implementing primitives in GHC.Prim and it didn't work out since it requires the way the primitive APIs are designed does not allow us to. In the very early days we used ByteBuffer's as the underlying representation of Ptr and it lead to bad memory leaks which is why we worked on a sophisticated emulation layer instead.

rahulmutt avatar Jan 23 '19 12:01 rahulmutt

@rahulmutt, thank's for the steps; with the help of that and the instructions in the Readme file I am on my way.

Your etlas build system is actually quite amazing for a project in its early stages, and Eta's ability to both interface with Java libraries and almost directly use Haskell ones is a very large wealth of resources and I can very much see why you chose to go the Haskell route. This is a huge conglomeration of resources!

I quite see how ByteBuffer's wouldn't work after considering the details of how the Haskell code would see them, and how they would have to support Haskell requirements. I can also see that, while no doubt better, this second evolution of using copy operations to marshal data between Java buffer areas and Haskell expected data isn't going to perform all that will just as experienced. I really think that this proposed solution should be the very best in that the Java byte code will be dealing with arrays just as it is most tuned to do, and the higher order Eta/Haskell code is still going to see arrays just as expected, too. The only thing that still won't work very will is low level Ptr operations, but then they never were going to be very fast and if we do support them in some form it won't be any worse than the current implementation.

Now to get to work and just get it done.

GordonBGood avatar Jan 23 '19 13:01 GordonBGood

@rahulmutt, I'm into it, and see what you've done, but need a brief discussion, although I don't want to waste your time.

I perused the current state of the code and made a preliminary plan, then slept on it as that produces the best results.

That produced the following results: we can wrap conventional Array's inside IArray's and MArray's with one major likely deal-breaking limit - one can't do castSTUArray without a copy and a copy would destroy the reasons for using this unsafe routine: treating primitive data, as some other kind of primitive data. Now one could perhaps overcome that at a cost in performance with some sort of copy management that would make unsafeFreeze no longer any different than regular freeze in all or just in some cases (when the array to be frozen has been cast to a type different than the original type). This method would then be still an option, and would bypass all of your current primitive operations altogether. However, (when required) these copy operations wouldn't be very fast, as we would likely have to do the copy operations by manually deconstructing the primitives and rebuilding them for each set of possible conversions.

Let's hold that as a possibility at look into the alternatives:

  1. You have already pursued direct use of byte[] and found it not to be satisfactory, and I can see that wouldn't work any better and likely worse than the above plan. In fact, since we are completely bypassing primitives here, it could be made to work here where it didn't before, but there is no advantage over the new plan.
  2. You said you have tried the use of ByteBuffer but had problems with memory leaks but didn't provide many details. One of the criticism's I might make of your implementation of your ByteArray Memory Managaged emulation class is that the direct memory it wraps is pinned where I see no need for that: Although the Haskell primitives do allow for pinned ByteArray#, they are usually only used for the rare case of operations using low-level C-like pointer operations, which we would never use in Java. Therefore, I don't see the need for the allocated memory to be pinned. Why did/do you think it needs to be? If you did it to fulfill the requirements of the Haskell primitive pinned ByteArray, then your ByteArray emulation could be used only for that (which won't ever be used for Java) and do something else where pinned ByteArray is not needed, as the Haskell primitives make an exact distinction between the two.
  3. If allocated memory doesn't need to be pinned, then why allocate it outside of the Java Heap at all? It would be much easier to just let the JVM handle memory with its own GC, at which it is very efficient.
  4. If we are no longer emulating memory management, then the ByteBuffer may become a very good option, as it has been tuned for performance in the JVM and offers the required facilities to get/put primitives of different byte sizes.
  5. If we can use ByteBuffer, then we don't need the inefficient copy operations at all.
  6. By patching the Data.Array.Base module to bypass your primitives, we don't actually have to change the primitives right away or ever, so this can easily be experimented with.
  7. It seems to me that using ByteBuffer by patching is likely a worthwhile experiment unless you see some reason it can't work.
  8. Relative performance between ByteBuffer and Java primitive arrays can be determined beforehand without doing any changes/patches to the modules by a set of benchmarks using each.

So, barring any response from you that you know something different, my plan forward is as follows:

  1. Spend a day or so doing relative benchmarks between ByteBuffer and Java primitive arrays including the relative cost of copy operations between different types.
  2. Depending on which gives the best general performance, bypass all of your primitives by patching the array package to use whichever seems best.
  3. The code will almost certainly not use your ByteArray emulation and only do low level pointer copy operations for the "handle" operations (again which will never be used as they are meant for direct interface to C-type operations). Since it seems your MemoryManager already deals with ByteBuffer, this should work directly to fulfill that contract (even though it will likely never be used).
  4. When this is finished including testing and benchmarking, you might consider changing some of your primitives to support this directly. If that were done, patches required would again go back to the current very minimal.

GordonBGood avatar Jan 23 '19 23:01 GordonBGood

The plan seems good. I agree that implementing them directly vs using the primitives is the way to go for performance.

To clarify, I said that using ByteBuffer as a direct representation of Addr# (the primitive underlying Ptr a) caused problems because to preserve referential transparency in all the Addr#, we had to duplicate the ByteBuffer which amounts to an allocation on every pointer addition, subtraction, or multiplication which is clearly unacceptable.

To solve that problem, we changed the underling representation of Addr# to Java's long which represented a memory address and worked on a MemoryManager that provides operations to retrieve the values given just a long. The MemoryManager even provides an API to retrieve a ByteBuffer that represents in the entire allocated region pointed to by the long. Pointer operations became super cheap at the cost of pointer dereferences becoming slightly less inefficient (think of a couple bitwise operations + array accesses). Bulk operations on memory are still just as efficient.

A ByteBuffer is just a interface that allows reading and writing bytes in a structured way. You can wrap a byte[] into a ByteBuffer or you can wrap native memory into a ByteBuffer via ByteBuffer.allocateDirect. The MemoryManager supports both and has distinct address spaces for heap managed memory & direct native memory depending on whether the unpinned or pinned memory was requested. It takes care of keeping track of free regions and responding to allocation requests accordingly reusing free regions when it can. Just think of it as a malloc implementation on the JVM.

rahulmutt avatar Jan 24 '19 04:01 rahulmutt

@rahulmutt, it looks like using ByteBuffer and primitive-sized views into it adds only very little overhead as compared to using primitive Java arrays directly and it makes it easy to support casting via STUArray to another primitive type.

It is somewhat complicated to sync writes to various STUArray's that have been cast to other primitive types which may each be modified at different times if they were based on standard primitive Java arrays, even though this use in Java is likely to be fairly rare. If we have "dirty" flags and a backing store (likely just a ByteBuffer), checking and syncing these stores will add more than the overhead we would have gained in not using ByteBuffer in the first place and the code would be quite complex.

The best thing I can come up with in order to use primitive Java arrays directly is to define a wrapping data type replacing bytearray# which can be an optional type covering the possibilities of being one of each of the Java primitive arrays directly if the STUArray has ever been cloned, changed to a view of a ByteBuffer of one of those same primitive types when it is cast with all subsequent using casts or copys using the ByteBuffer option. In the common case, these are directly used; in the less common case of being cast then they have the overhead of using the embedded shared ByteBuffer reference until such time as they are frozen in which case the ByteBuffer is copied to a regular Java primitive array. This shouldn't have an execution cost if they aren't cast, as the compiler should recognize their unmodified state and apply the direct access methods as usual; there will then only be a cost in the uncommon case that they are cast, but I'm worried about avoiding having to do a run time check on whether the array has been cast or not before being able to operate on the data. which would add overhead and overcome the benefit.

This wouldn't affect IArray's which can't be cast and therefore can use standard Java arrays directly and which would have the underlying ByteBuffer already copied back to a standard Java primitive array on freeze if the creating STUArray had been cast.

If we went that way, we would have to add documentation to advise against the use of castSTUArray without considering the performance cost: one of the most common usual reasons for its use is to improve processing speed by, for instance, doing popcount 64 bits at a time instead of 8 and this may or may not still be effective; another reason is to simulate simple byte packed record primitive operations without defining a new instance of STUArray and again this may or may not be effective.

I would like to have a "raw" way of doing fast memory copy operations as a static "copymem" method, but although the DotNet CLI/CLR has an unsafe Marshal class that provides this, I don't know of anything equivalent in Java.

In the end, since use of ByteBuffer views works so well, I say "why go to all of this complexity?".

My current plan once I finish a little more performance testing is to implement the array patches only using ByteBuffer to see where we are as to speed, then once that is working to consider refactoring to the common cases where we don't need ByteBuffer but can use direct Java primitive arrays, with retaining the ByteBuffer code for the "cast" cases when we need it. If supporting both casting and direct Java array use has a performance cost (or as I anticipate, little performance gain), we will have to evaluate which road to take then.

This has gone in a considerably different direction that I had thought when I started!

GordonBGood avatar Jan 24 '19 07:01 GordonBGood

Thanks for the progress update. At the end, we'll let the benchmarks decide which way to go.

In terms of efficient copying, System.arraycopy is what you're looking for. Unfortunately, ByteBuffer doesn't appear to have an equally efficient API.

rahulmutt avatar Jan 25 '19 02:01 rahulmutt

@rahulmutt, yes on letting the benchmarks decide...

I had found System.arraycopy but it can only copy between arrays of the same type, so a cast copy would require manual conversion. DotNet CLI/CLR Marshal class has a very low level copy by bytes from pointers (IntPtr) meant for use in interop with C code which drops right down to the lowest level, but is very fast.

Since I think that ByteBuffer will work out quite well for the UArray's and without all the complexities; with it there won't be much copying required as one can just change the view type, and when there is copying required it would be of the whole structure for which there are quite a few tools.

To get this done, I've had to bypass all the primitives to do with Array#, ByteArray# and MutableByteArray# which means replacing quite a lot of the functionality of GHC.Arr. In the interests of saving compile time, I've changed to developing the interface outside of the array-0.5.2.0 folder, test it, and then just patch it in using the procedure that you recommend. Seeing how buried this is in GHC.Prim makes me realize how much easier it would be if the primops where changed as per this new interface, retaining your emulation layer when working with addr#, pinned arrays, etc., and using this for the better than ten times speed increase otherwise. Anyway, we'll see what you think when it's done.

GordonBGood avatar Jan 25 '19 02:01 GordonBGood

You can try sun.misc.Unsafe which is probably the closest equivalent to Marshal. The Eta runtime already includes a utility to fetch an Unsafe instance which you can probably re-use. It unfortunately does not have clear documentation because it's not supposed to be used (though many widely used Java libraries do use it) so you can see the source or browse the web for blog posts.

You can actually define subclass of ByteBuffer that keeps track of the underlying type via an enum or an int and use the subclass throughout. If the type checks can be pushed to the Java side, it might be more efficient.

Yes, I'm open for adding new primitives into the compiler - I just need to get an idea of the type signatures required, so let's see how that pans out first. In the long run, it will be more efficient since the compiler will treat it as an intrinsic and direct write out the bytecode, avoiding the minor overhead of a foreign call.

rahulmutt avatar Jan 25 '19 04:01 rahulmutt

@rahulmutt, If we were to use the sun.misc.Unsafe class for fast byte copies, isn't there some risk that it might not be available on some platforms such as OpenJDK or especially more specialized ones such as Dalvik or whatever Google eventually uses to replace it? Also, there is some risk that it will change across the platforms as it isn't part of the specification. It appears to be difficult to use and I would rather not go that way. So far, and hoping the use of ByteBuffer works out, there hasn't been any need to do copies that can't be very quickly done just by cloning the backing byte[] store, which will be optimized by the JVM itself and thus very efficient.

As to adding sub classes of ByteBuffer for representations by different types, since I am using ByteBuffer as the backing store only for the "Unboxed" Array types and since there are already instances of each UArray type, the Eta compiler already well knows the type and the function parameter types that work with it, so I don't see a need to add anything. Haskell/Eta have stronger type checking than Java and since this checking is done at compile time, any type checking won't affect performance (actually, your suggested method could cost some performance in checking the buried type option as if it were a dynamic type), I'll leave this one alone for now, too.

I am already using your primitives when I can: I use the primitives to do with jobjectArray# for working with the Java Object[] array I use with boxed Arrays, and would like to have equivalent primitives to work with ByteBuffer more directly (there will be considerably more primitives to support ByteBuffer as it does so many more things!). I agree with you that this should make it even more efficient.

As to progress, I am coming close to completing work on the Boxed versions of the arrays, but the Unboxed versions may take a little longer since I'll have to invent more things as to the interface for ByteBuffer.

GordonBGood avatar Jan 25 '19 06:01 GordonBGood

@rahulmutt, as to primitives, the very best way would to make the "standard" Haskell/Eta primitives just do what we are doing built on top of them or in by-passing them; in this way, the compiler would be only emitting Java byte codes for the whole works and it should definitely work faster. It also has the advantage that more Hackage packages won't need to be patched at all but will work "straight out of the box".

GordonBGood avatar Jan 25 '19 06:01 GordonBGood

@rahulmutt, I am wondering if you can help me:

I've emulated primitives and implemented arrays using standard Java Object arrays and have used your jobjectArrayNew#, jobjectArraySet#, jobjectArrayAt#, and alength# primitives to do it, thinking that they access the low level byte codes and should be faster. Everything type checks and compiles except for one thing: when object arrays are first created, they are filled with arrEleBottom, which is purposely untyped so as to be able to be used with any type of array. With real primitives, type checking would be ignored for these and everything just works, but with my emulated primitives written within Eta calling your primitives, everything must type check as these primitives expect definite object types and these leave leave ambiguity with confusion over what type of object array to create.

There are a few solutions, but not ideal as follows:

  1. Use some called FFI Java code to write these out as plain untyped raw Objects which would be compatible with any sub type.
  2. Use the reflection interface through the Array static type to put this into place on initialization, but I'm worried the reflection might be slow.
  3. Use a Java interface of some sort to fill the Object array with nulls (actually, I believe that's automatic on creation) and then use code to set the standard error message on null is detected, but that takes exception processing through Java.

None of these seriously bother the work if they aren't too slow, but they are a concern.

The other thought that just came to me is that although unboxed primitive arrays are slow because they use the memory allocation emulation interface, there would be no need to emulate that interface for normal Boxed Arrays and they may be of normal speed. I haven't gotten around to testing this but I'm sure you remember how they work and can advise me on this.

For now, I'm not going to fill the arrays and just leave them full of nulls by default and will worry about this later.

GordonBGood avatar Jan 25 '19 14:01 GordonBGood

@rahulmutt, I've done some more testing and believe I see that your "primitives" for boxed arrays just uses your Java.Array module so should be about the same speed as Java using object arrays, which (of course) is slower than using unboxed primitive arrays - I measure about 6.5 times slower.

So I'll just let normal arrays default to the normal behavior. However, testing has revealed a problem in your Java.Array module in that due to the way your Java monad is defined it breaks the compiler's ability to do proper Tail Call Optimization (TCO) and the resulting stack of thunks results in SO for code that would normally run. I'm not sure that I'm equipped to handle this as part of my project here, and you said that your project at this time is improving FFI, so perhaps you could look into it? Perhaps it is just lack of rewrite rules for your monad as compared to the usual standard Haskell ones? Or are there differences in using runRW# and needing to run extra transformers when operating under the extra level of the ST monad? I am thinking that the "primitives" shouldn't be dealing with other monads or the runRW# transformer at all as they already track mutability through the ST monad system.

I think that the entire FFI system needs to be overhauled, but you know that and that is what you are doing.

To cause the SO, just change my primes benchmark to use STArray instead of STUArray along with the change to use runSTArray instead of runSTUArray. Expected behavior is the same result as before just slower, but I think you'll find SO.

I'm going to set this part of the project aside and work only on the unboxed arrays.

GordonBGood avatar Jan 25 '19 22:01 GordonBGood

@GordonBGood If you can file a separate issue with a minimal reproduction example, I can look into the SOE. The runRW# is to prevent the compiler from lifting the computation out which caused weird bugs in the past (#171).

The FFI system is being overhauled right now as #647 is being implemented.

rahulmutt avatar Jan 26 '19 05:01 rahulmutt

@rahulmutt, Re: the occurring SOE calling Arrays using your primitives that hook into the Java.Array module, I'm starting to think that it may be a manifestation of the same problem as I had before in issue #934 to do with the latest version 0.8.6b4 having changed strictness behavior and the SOE occurring due to the stacked evaluation Thunk's, and am trying to prove it one way or the other. If it is something new, I'll open a new issue, but if is the same problem, I'll re-open that issue. At least you have been made aware that there is something...

Yes, I see the need for something like runRW# but wonder if there might be another way. I'll try to investigate...

For the record, I did test the use of the reflection interface Array.set/get and as for most uses of reflection, it is dreadfully slow and not useful for our application.

I've concluded that the best option for IArray's is the primitives you already have and will just work with you in testing to see what their limitations or problems are.

So this project is now just to improve the Unboxed versions of UArray/STUArray.

GordonBGood avatar Jan 26 '19 06:01 GordonBGood

@rahulmutt, as to the strictness problem I had before, I can't repeat it in a small few line example and it only occurs when I use it as a benchmark where the counting is backed by the construction of a data structure than may optionally be lazy unless forced as when I mentioned it, so I have held off reporting an issue. In fact, as per the reason I closed the issue before was that the behavior is essentially the same as it would be in Haskell, and therefore not really an issue, just unexpected. I could open another issue mentioning it and posting the code that causes it if you would like though, but I don't want to over-report issues that aren't really our concern.

As to this second evaluation problem where using STArray is less strict than using STUArray, I just checked and its the same in GHC Haskell so we can't fault Eta there either. It does make it harder to deal with the STArray type unless we "cheat" and use the Strict extension, though.

So back to the main problem: while I was checking out the above, I did a few more benchmarks and it turns out that even though you may be using Java Object Arrays under the covers of the "primops", Eta is still slower at using STArray than Object Arrays in Java by a factor of six or more, so it still needs some work. And I reconfirmed that Eta is over ten times slower at the use of STUArray than using primitive arrays in Java, so that's even worse.

I'm back to the original plan, which may take a little more time than expected...

GordonBGood avatar Jan 26 '19 09:01 GordonBGood

@rahulmutt, on the subject of your implementation of pinned addr# using a wrapped version of ByteBuffer, why do you need to emulate a pinned ByteArray or memory outside of the Java GC'ed heap? The only reason for it that I can see is for JNI in interfacing with C-like languages, which is very slow and seemingly meant for "bulk" compatibility for large increments of work or data to cross between Java and those languages, as Java itself has no other provision for direct reference to memory that I know. Does Eta support JNI?

The reason I ask is that I am using this project as a test case to try to convince you to replace all the primitives for Array#, ByteArray#, MutableByteArray# (and likely the related SmallArray# and ArrayArray#) with my new faster primitives, and the only relationship between Addr# and these is that one can obtain an addr# with byteArrayContents# and then use it to access the byte array. The note says that the byte array should be pinned, but in this case it is just a byte index into the byte[] data store so wouldn't need to be.

As I said, one would need pinned addresses if one were passing them to JNI and a means to access "pinned" arrays passed in from JNI (if they weren't copied in) but that would be it and could be handled separately by having a couple of types of addr# if necessary, even if it was not so efficient since JNI isn't very efficient itself.

So, can Eta use JNI?

GordonBGood avatar Jan 27 '19 01:01 GordonBGood

Honestly, I just wanted to preserve semantics in the beginning and then the goals changed to just doing what's right for Eta and deviating whenever it made sense. This may be one of those cases where it may be useful to deviate. We do want to add support for JNI at some point in the future, but not a huge priority right now.

Try and this experiment and see how this affects performance:

You can disable all uses of native memory by changing this line: https://github.com/typelead/eta/blob/master/rts/src/main/java/eta/runtime/io/MemoryManager.java#L59 to

long address = globalManagedHeap.allocateBuffer(n, false, Capability.getLocal());

and then running eta-build rts-clean so that the build system can recognize the new RTS. You need to build eta from source to be able to do this if you haven't already.

rahulmutt avatar Jan 27 '19 02:01 rahulmutt

@rahulmutt, I know how projects like this go, and how you BDFL(s) are so buried into it that sometimes things get implemented just because they happen; it's what I bring to projects: the ability to step back and look at the low level implications and envision ways to improve them - I'll likely never get to your level of expertise at the high level of type families, type classes, monadic code and all the other sub variations, implications of SemiGroup's, knowledge of all the available GHC extensions, etc., etc. :), but I have 50 years of experience at the low level with some (enough maybe) knowledge of the higher abstractions.

No, I haven't built eta yet but I'll get around to it after this project. Or are you suggesting that building eta without direct memory support might fix all the performance woes and make this project unnecessary? I've cloned the Eta source files on my machine so I have a local copy already...

I'm amazed a what you have accomplished in, what, a year and a half or two? The build system is quite polished, documentation isn't that bad although it perhaps needs some updating (also depends on Haskell documentation, so you are lucky that way)...

Anyway, my main interest is performance so back to work!

In thinking more about the performance implications of turning direct memory off, although it will make SOME difference (which is confirmed by the Java docs), I don't think that it is the major cost in that I think that the many nested function calls in the indirection are likely the major cost.

My main question for today is whether your Array# primitives use the same emulation as ByteArray#'s do or do they call Java Object arrays directly? You can likely answer me very quickly without me spending more time digging though your code.

If you don't use them directly, that probably explains the huge performance loss for Array# as well as ByteArray#. I'm building my mutable primitives inside the ST monad instead of inside the Java one to avoid the extra indirection and it will be up to the programmer to avoid shared modifications of unsafe thawed arrays by separate threads, just as for Haskell. If I'm right...

GordonBGood avatar Jan 27 '19 05:01 GordonBGood

We re-use existing stuff from the Haskell community when it doesn't make sense to re-invent the wheel and we make tweaks to take into account the Java ecosystem aspect. Etlas is just a fork of Cabal which is why it's so smooth. Another contributing factor is that is incredibly easy to debug any weird bugs since the JVM provides great introspection mechanisms. GHC on the other hand has to do debugging of native binaries which is a lot more time consuming.

I meant do a build from source and see whether turning off direct memory causes a sizeable difference. If it does, then we can disable it by default and make a runtime flag for those people who want it.

But yes, I agree that it's more than just the difference of native vs heap that's causing the perf difference.

There are a couple of ways to get a real sense for where the perf loss is coming from:

  • STG Code - Add this to your cabal file indented under your executable section: eta-options: -ddump-stg -dsuppress-all -ddump-to-file Then, once you run etlas build, run find dist -name "*.dump-stg" to find the generated STG dump files. You can view them with a normal text editor and it can be a bit hard to read but it's just a really simplified functional language without all the syntactic sugar cut out. You essentially have let statements which allocate a new thunk, and case statements which evaluate expressions - can correspond to a direct function call or a thunk evaluation depending on the structure of the statement. STG code is the last IR before compiling to Java bytecodes. Note that the names in STG are important because they are using to name the compiled class files.
  • Failure of Rule Firings - The array library uses a lot of RULES so perhaps for whatever reason it's failing to fire. You can debug this by adding this to your cabal file: eta-options: -ddump-rule-firings -ddump-rule-rewrites -ddump-to-file
  • Decompiling to Java - Using tools like Bytecode Viewer you can directly decompile an Eta jar file and look at what the Java code would have looked like. You can find the jar again by using find dist -name "*.jar"
  • Analyzing JIT Compilation - You can have the JVM dump information about inlining/compilation of hot methods and using tools like JITWatch to figure out if anything can be made better. You can also coax the JVM to give you the assembly code if that helps.

And in general, you can just profile and a mixture of all the methods above to figure out what's going on and where there is room for improvement.

rahulmutt avatar Jan 27 '19 07:01 rahulmutt

@rahulmutt, I had never used cabal much, just the ghc compiler, so didn't recognize it in Etlas. I see in the Eta repo that you have been able to reuse the majority of the GHC code. Yesterday and last night I went digging for where the performance losses come from and got lost in the i Prim primitives implementation of the STG code, kind of understand the process but not enough to see how it emulates the primitives to do with array access. If this project works out as to increasing the performance to do with arrays, I would like to do a PR changing the primitives for your consideration, but it seems I don't have enough knowledge to do that yet.

I believe that you have tried to be too faithful to the emulation of the relationship between GHC and underlying C code that doesn't cross over to the Java interface very well as to performance even though it works; I think it is something like your versions per 0.0.9 using ByteBuffer directly which you have since bypassed and expect the things introduced by my project here to improve things yet again. I don't propose entirely replacing your direct memory allocation tools or implementation as I think that may be the only way to provide JNI capability, but rather to provide other faster ways when that capability is not required.

I am somewhat familiar with the viewing and analyzing the STG code as I once worked with SPJ himself on a couple of GHC issues and he walked me though it.

However, I mentioned that I'm not really a Java guy, and the help you have given in recommending Java/JVM tools has been invaluable, but I'm still learning just enough to get by. Anyway, I'm hoping that the benchmarks will tell the tale, and any small tweaks that might be revealed by the tools, you will be able to help.

BTW, in your last paragraph you talk about making the JVM spit out the assembly with hldis, a process that I had read about before but have never used as it seems quite involved, but you also mentioned using JitWatch, whose website says "To see how your source code is translated into bytecode and assembly.". Is that assembly different than the long and involved process via hldis (the other process)?

Yesterday, I spent quite a lot of time digging into the Eta repo, and also looking into why types defined to call FFI Java classes have "nominal" instead of "representational" roles. I didn't really want to have to change much in the Eta code base, but in order to bypass the current GHC primitives, I am having to deal with GHC.IOArray and especially GHC.Arr which otherwise would call them, which means understanding the process at least a little bit. Currently, I'm just "hooking" these two modules to bypass those parts I need to have done by this alternate method.

Anyway, I'm back to the main project again now.

GordonBGood avatar Jan 28 '19 05:01 GordonBGood

@rahulmutt, I'm sure you can give me some quick advice: I'm testing the module but it takes quite some time to run "etlas clean" and then "etlas run" when there are changes to packages other than the testing project as you taught me how to do, so I've tried to move the Java files and some core functionality outside the "array" package to test it separately, then add them back in once they are working.

I've successfully created a sub folder and put the Eta test source code into them and successfully run that just by adding the files to the "extra-source-files:" in the project ".cabal" file; I've also created a "java" subfolder and put the Java code to be tested in there and added a "java-sources:" under "executable" in the cabal file. I've tried to model all of this after the basics interop documentation.

First, I'm not sure what package name to put at the top; I've tried "eta.test" but Java complains that the package doesn't match the folder structure. If I put the file in an acceptable folder structure, how do I import it using FFI? I've tried to import a static method as "eta.test.Test.doit", where "Test" is the class in the Java file and "doit" is a static method in the class, but it never finds the class, no matter whether I use the package matching the folder structure or not. I've wasted quite a lot of time on this.

Can you give me a short example including folder structure, "cabal" file requirements, and "foreign import" requirement line in order to do this?

GordonBGood avatar Jan 28 '19 22:01 GordonBGood

@rahulmutt, I've been following up on something you said something like "startup/warmup time for Eta is longer than for Java". I expected it to be slower just as Scala is slower due to time to start the runtime library, which for Scala is quite extensive as I assumed that Eta would also be. However, it is really, really slow. Once this slow array thing is fixed, the slow startup time may be the thing that prevents people from switching from Java or Kotlin to Eta.

Kotlin startup time on my admittedly slow machine is still something under a second, with Java a little faster. Eta startup time approaches 10 seconds! This is running a stripped "uberjar" directly from Java.

Is there anything that can be done now or in the future to reduce this?

GordonBGood avatar Jan 29 '19 02:01 GordonBGood

I'm not understanding why java complains about the folder structure.

Say you have: java/Utils.java

package eta.array.test;

public class Utils {
   public static void someMethod() {...}
}

You can put java-sources: java/Utils.java and in for importing you can do:

foreign import java unsafe "@static eta.array.test.Utils.someMethod" someMethod :: Java a ()

One way to reduce the startup time is to switch over to invokedynamic-based code generation to lazily generate classes for thunks and function closures, but invokedynamic has its own startup time issues which we need to take a look at and see. Or maybe there's another factor affecting startup time that we need to take a deep look at. Again, filing an isolated issue with an example that demonstrates this would be useful.

How did you create a stripped "uberjar"? Did you run Proguard yourself? By default, uberjars generated by the compiler or Etlas are not stripped.

rahulmutt avatar Jan 29 '19 04:01 rahulmutt

@rahulmutt, It may be just my editor Java plug-in (VSCode), but it flags a statement such as package eta.array.test; as an error unless the "Utils.java" file is inside a java/eta/array/test/ folder structure. I have tried both ignoring the error or building the requested folder structure so the error flag goes away with the same results (I change the cable 'java-sources:` to match whatever the folder structure is), so think that this may be just an unimportant side issue. The interesting thing is that when I view the "Util2.java" file inside the "array" package, the plug-in doesn't flag any error.

Then, in my "Main.hs" file I put the foreign import statement just as you recommend and call java someMethod in the main function, and the program compiles successfully but when run it reports a "undefined class error can't find eta.array.test.Util" or something to that effect no matter what I try. It looks like I'll have to send you a zipped file of a minimum project to see what you think.

As to startup time, I decided to try the "uberjar" and "stripped" (by "stripped" I just mean the "--enable-executable-stripping" option) to see if distribution sizes are reduced or what they are like compared to using Clojure, with which I have some experience. I didn't really worry about the "stripped" option as the output ".jar" files aren't all that large, even as an "uberjar" (means standalone, right?). And the reason I tried them is to see if startup time would be reduced, which it wasn't. Ever since I started using Eta, startup time has been much longer than an incremental build, and almost as long as a full build cycle after a clean. I'll file an isolated issue, likely with the same sort of minimal project as for the FFI problem described earlier. I'll upload it first here related to FFI to see what you think, and then open other issues as you recommend.

The project compiles successfully, but when run produces the following error and takes almost six seconds to startup on my machine, even without the function that calls the someMethod or the foreign import (ie. a simple hello world program):

Exception in thread "main" java.lang.NoClassDefFoundError: eta/array/test/Utils
        at main.Main.$wa(Main.hs:10)
        at main.Main.main1(Main.hs)
        at main.Main$main1.applyV(Main.hs)
        at eta.runtime.exception.Exception.catch_(Exception.java:132)
        at main.Main.main3(Main.hs)
        at main.Main.DZCmain(Main.hs:9)
        at main.Main$DZCmain.applyV(Main.hs:9)
        at eta.runtime.stg.Closures$EvalLazyIO.enter(Closures.java:154)
        at eta.runtime.stg.Capability.schedule(Capability.java:254)
        at eta.runtime.stg.Capability.scheduleClosure(Capability.java:210)
        at eta.runtime.Runtime.evalLazyIO(Runtime.java:397)
        at eta.runtime.Runtime.main(Runtime.java:390)
        at eta.main.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: eta.array.test.Utils
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        ... 13 more

Minimal FFI project file zip attached: eta-ffi-helloworld.zip

GordonBGood avatar Jan 29 '19 05:01 GordonBGood