networkanalysis
networkanalysis copied to clipboard
Added 64-bits array capabilities
This PR adds 64-bits capabilities to the networkanalysis
package because natives arrays in Java are unfortunately limited to 32-bits.
There are four new classes added to the nl.cwts.util
package:
-
LargeBooleanArray
-
LargeDoubleArray
-
LargeLongArray
-
LargeIntArray
These classes enable 64-bits indices (i.e. long
s) which are translated to two int
s (using bitwise operators) that index an internal array of arrays. The first array index is called the segment
and the second array index is called the offset
. The value for a provided long index
is then located in values[getSegment(index)][getOffset(index)]
.
Additionally, these classes replace the old DynamicIntArray
and DynamicDoubleArray
, since these also need 64-bits support. Hence, all the 64-bits arrays are now also dynamic and can be resized (see ensureCapacity
, shrink
and resize
) and values can be added/removed dynamically (see append
, push
and pop
).
Additionally, these classes needed support for some other functionalities, which are also all implemented. There is one functionality (sorting) for which I thought it would be more straightforward to use an existing implementation (it.unimi.dsi.fastutil.BigArrays
). Essentially this uses a similar setup in the background (array of arrays). The newly created classes are essentially similar to the BigArrays
implementation in fastutil
but provide more convenience functions, so that it is easier to translate the rest of the package to be able to use these 64-bits capabilities. I added the fastutil
dependency to the build.gradle
file, and it is now automatically bundled in the resulting jar
file.
All necessary modifications to all other classes are also made. This essentially entails changing array[i]
notations into array.get(i)
or array.set(i, ...)
notations.
Sometimes there is a more convenient form of iteration possible that using a for-each idiom, for(double x : array)
which can be substantially easier to understand. Additionally, it is faster than iterating over long
indices that have to be separated into two separate int
s each time. This allows to change loops of the kind
for (k = network.firstNeighborIndices[i]; k < network.firstNeighborIndices[i + 1]; k++)
{
v = network.neighbor.get(k)
...
}
by the simpler and more elegant construct
for (int v : network.neighbor(i))
without a performance hit, and even a performance increase.
Unfortunately, in many places these constructions are not immediately useful, because we need to iterate over both the neighbors
and the edgeWeights
simultaneously, which is not possible in this idiom in Java (at least not in an elegant manner). Hence, except for one construct I now left all for loops as they were.
In addition, I also added some similar convenience iterators in the Network
class, allowing to iterate over neighbors
, edgeWeights
or incidentEdges
easily. There is the possibility to replace many loops of the kind of
for (k = network.firstNeighborIndices[i]; k < network.firstNeighborIndices[i + 1]; k++)
by
for (long k : network.incidentEdges(i))
which is substantially easier to understand. This will come at a slight performance hit probably, but I haven't tested this change explicitly. At any rate, I left these for loops as they were originally, and we might look into these possibilities at some later time.
Finally, I included unit tests for all these new classes, including tests for all methods. Additionally, I added a simply unit test for the LeidenAlgorithm
and a check to see if the aggregation and quality calculations are consistent. These are automatically run when calling ./gradlew build
. I have also verified the new classes with sizes of twice the maximum size of individual arrays (which is a total size of 2 147 483 648
, exceeding the maximum length of a 32-bits arrays by a few elements). This revealed one remaining problem in the tests themselves (i.e. not in the code, but some test was incorrect), which I am re-testing now, but I do not expect any problems anymore.
I tested this implementation against an example graph . The original implemetation took 40s, whereas the new implementation took 51s on my laptop. This performance decrease is of course a pity, but without investing substantially more time in rewriting aspects of the code, it is unlikely that we are able to improve the performance substantially. But suggestions are of course welcome.
Comments are welcome!
I went through all new classes to conform to the code style used throughout this package.
I rebased the PR against the master
branch so that there are no merge conflicts and force-pushed to this PR. Apparently the GitHub actions do not run in the case of merge conflicts.
See 15cf64289afd05ecc2da744cefcdf4bbb63f48f5, 94c72d274685c0d2a83446ff8b9bc52885fddaab, 472d10b274fc9124e812d1773bdbf945fc7588c5, and f83fe56df4359e16e9c26e5588b81ba389a4b513 for some bug fixes and enhancements.