py4j
py4j copied to clipboard
Binary Transfer - Use Cases
Please describe how you hope binary transfer would work for your use cases, by filling this small survey as many times as needed. Be specific: of course it would be nice if Py4J supported every use cases, but if I see that a particular usage scenario is more important for the community, I will at least make sure to that the final API makes it easy to optimize for these scenarios.
Simply copy paste the survey and fill it.
# Binary Transfer Survey
## Direction
Check the most common scenario (only 1).
- [ ] I want to transfer binary data from Python to Java
- [ ] I want to transfer binary data from Java to Python
- [ ] Both sides are important for my use cases
**Comments:**
## Remote
Check the most common scenario (only 1).
- [ ] Java and Python runtimes are usually on the same machine
- [ ] Java and Python runtimes may be on separate machines on the local network.
- [ ] Java and Python runtimes may be on separate machines on internet-like network (slower than LAN)
**Comments:**
## Datatype
Please describe the origin and destination datatype of your usage scenario. e.g., Python: bytearray to Java: byte[]. Mention as many pairs as needed.
- Pair 1:
- Pair 2:
- Pair 3:
**Comments:**
## IO Transfer type
Check the most common scenario
- [ ] Blocking IO - Transparent API.
e.g., methods only return once all the bytes are transferred. something like:
bytes = getBytes() # bytes is now a bytearray or byte[]
- [ ] Blocking IO - Streaming API.
e.g., you can transfer the bytes at your own pace once you have a reference to the data structure. something like:
bytes = getBytes()
stream = stream(bytes)
chunk = stream.receive(1024) # loop to get all chunks
- [ ] Async IO - Please give an example of hypothetical API usage
- [ ] Another strategy?
**Comments:**
## Size consideration
What is the typical amount of bytes transferred between Java and Python?
- Minimum:
- Maximum:
- Average:
## Memory consideration
If Python and Java are executing on the same machine and you want to transfer 1 GB of data from Java to Python, the same gigabyte will now take twice the memory: 1 GB in Java and 1 GB in Python.
- [ ] That's fine.
- [ ] Please provide a way to discard the data on the sending side as it is sent to the other side.
**Comments:**
## General comments and description of your use case
Binary Transfer Survey
Direction
Check the most common scenario (only 1).
- [ ] I want to transfer binary data from Python to Java
- [ ] I want to transfer binary data from Java to Python
- [X] Both sides are important for my use cases
Comments:
Two different key use cases already in use with alternate technology:
1- Transferring binary data from Python to Java to use Java visualization UI on Python based objects (typcially ndarray). 2- With Java based workflow engine, running one "step" of the workflow in Python. In this case large object(s) are tranferred from Java to Python, processed in Python and returned to Java.
Remote
Check the most common scenario (only 1).
- [X] Java and Python runtimes are usually on the same machine
- [ ] Java and Python runtimes may be on separate machines on the local network.
- [ ] Java and Python runtimes may be on separate machines on internet-like network (slower than LAN)
Comments:
The concrete use cases I have are all on one machine.
There is increased interest in compute farm use, however the general view at the moment is that we only plan to pass metadata around between machines in this use case. For example, in the workflow case, there is a Java driver on one machine farming off jobs to other machines. However, it is inefficient for the driver machine to have all data pass through it.
Datatype
Please describe the origin and destination datatype of your usage scenario. e.g., Python: bytearray to Java: byte[]. Mention as many pairs as needed.
- Pair 1: Python NumPy ndarray <-> Eclipse January IDataset
- Pair 2:
- Pair 3:
Comments:
Note the Eclipse January project is a new project. The source of the code is the Diamond Light Source IDataset implementation. We are currently extracting out the common/general parts of that code base for wider use. https://github.com/eclipse/dawnsci/blob/master/org.eclipse.dawnsci.analysis.api/src/org/eclipse/dawnsci/analysis/api/dataset/IDataset.java
IO Transfer type
Check the most common scenario
- [X] Blocking IO - Transparent API. e.g., methods only return once all the bytes are transferred. something like: bytes = getBytes() # bytes is now a bytearray or byte[]
- [ ] Blocking IO - Streaming API. e.g., you can transfer the bytes at your own pace once you have a reference to the data structure. something like: bytes = getBytes() stream = stream(bytes) chunk = stream.receive(1024) # loop to get all chunks
- [ ] Async IO - Please give an example of hypothetical API usage
- [ ] Another strategy?
Comments:
We have not explored the options with streaming APIs fully. What we have started looking at, that is quite similar, is lazy transfers. We already use the lazyness concept to handle loading very large files in Java - https://github.com/eclipse/dawnsci/blob/master/org.eclipse.dawnsci.analysis.api/src/org/eclipse/dawnsci/analysis/api/dataset/ILazyDataset.java - and numpy has similar support on Python side. It is possible that transparently we use Blocking IO to transfer data, but that the data itself is only transferred on actual access.
Memory consideration
If Python and Java are executing on the same machine and you want to transfer 1 GB of data from Java to Python, the same gigabyte will now take twice the memory: 1 GB in Java and 1 GB in Python.
- [X] That's fine.
- [ ] Please provide a way to discard the data on the sending side as it is sent to the other side.
Comments:
General comments and description of your use case
In general, we see doing binary transfers transparently between processes as something that works up to a certain scale of data. The target upper end is 100 million data points, typically of double precision data. Beyond that size, the data starts to need more application specific management, and a set of meta data is needed to handle the transfers. Using pandas (and equivalents on Java side) to describe more compicated collections of data is on the table going forward, but they have not been fleshed out properly yet.
@jonahkichwacoders how do you serialize ndarray to IDataset (and conversely)?
On Python we use numpy.save and load, see pyflatten.py#ndArrayHelper for example of use.
On Java we have load/save implemented to the IDatasert here: loader and saver
The file format is described at numpy: npy-format.txt. It is a very simple format, just the raw data with some simple headers.
From what I see in the code, if Py4J gives you on the Java side (1) a ByteBuffer or a Channel, (2) the size of the payload, and (3) some facilities to easily register custom type converters, you could reuse this code easily right?
Note: 100 million datapoints with double precision would be around 763 megabytes (100000000*8/1024/1024) which still fits in a byte array on the Java side.
From what I see in the code, if Py4J [...]
Yes, I think that is right. Interestingly, I don't know if (1) is even needed as when we did the initial work on this 5 years ago we determined that writing to temp files (relying on system cache to keep performance up) was much faster than any way we could stream the data. That is however an assumption that needs to be revisited, especially with MS optimizing the loopback performance.
Note: 100 million datapoints with double precision would be around 763 megabytes (100000000*8/1024/1024) which still fits in a byte array on the Java side.
Yes, In our experience, in numerous different scenarios (only some rpc related) dealing with datasets less than 1GB tends to just work without too much thought, but above that more and more secondary issues come into play.
@JoshRosen @davies @kaytwo If you are still using Py4J and could comment on this issue, that would be great. Thanks!
I'll defer to the Spark committers, they know far more about how it is used than I do.
I have a binary transfer needs , (1)transfer binary data from Java to Python, (2)Java and Python runtimes are usually on the same machine,(3)Java new String(Base64.encodeToChar(byte[], false)) to Python base64.b64decode(self._read_bag.read(size)),Each data about 4 Mega byte.
With base64 coding data transmission normally, but the performance is very poor, There are other better way?
extremely grateful!