Please describe how you hope binary transfer would work for your use cases, by filling this small survey as many times as needed. Be specific: of course it would be nice if Py4J supported every use cases, but if I see that a particular usage scenario is more important for the community, I will at least make sure to that the final API makes it easy to optimize for these scenarios.

Simply copy paste the survey and fill it.

# Binary Transfer Survey


## Direction

Check the most common scenario (only 1).

- [ ] I want to transfer binary data from Python to Java
- [ ] I want to transfer binary data from Java to Python
- [ ] Both sides are important for my use cases

**Comments:**


## Remote

Check the most common scenario (only 1).

- [ ] Java and Python runtimes are usually on the same machine
- [ ] Java and Python runtimes may be on separate machines on the local network.
- [ ] Java and Python runtimes may be on separate machines on internet-like network (slower than LAN)

**Comments:**

## Datatype

Please describe the origin and destination datatype of your usage scenario. e.g., Python: bytearray to Java: byte[]. Mention as many pairs as needed. 

- Pair 1:
- Pair 2:
- Pair 3:

**Comments:**


## IO Transfer type

Check the most common scenario

- [ ] Blocking IO - Transparent API. 
    e.g., methods only return once all the bytes are transferred. something like:
    bytes = getBytes() # bytes is now a bytearray or byte[]

- [ ] Blocking IO - Streaming API.
    e.g., you can transfer the bytes at your own pace once you have a reference to the data structure. something like:
    bytes = getBytes()
    stream = stream(bytes)
    chunk = stream.receive(1024) # loop to get all chunks

- [ ] Async IO - Please give an example of hypothetical API usage

- [ ] Another strategy?

**Comments:**


## Size consideration

What is the typical amount of bytes transferred between Java and Python?

- Minimum:
- Maximum:
- Average:


## Memory consideration

If Python and Java are executing on the same machine and you want to transfer 1 GB of data from Java to Python, the same gigabyte will now take twice the memory: 1 GB in Java and 1 GB in Python.

- [ ] That's fine.

- [ ] Please provide a way to discard the data on the sending side as it is sent to the other side.

**Comments:**

## General comments and description of your use case

Jul 05 '16 09:07 bartdag

Binary Transfer Survey

Direction

Check the most common scenario (only 1).

[ ] I want to transfer binary data from Python to Java
[ ] I want to transfer binary data from Java to Python
[X] Both sides are important for my use cases

Comments:

Two different key use cases already in use with alternate technology:

1- Transferring binary data from Python to Java to use Java visualization UI on Python based objects (typcially ndarray). 2- With Java based workflow engine, running one "step" of the workflow in Python. In this case large object(s) are tranferred from Java to Python, processed in Python and returned to Java.

Remote

Check the most common scenario (only 1).

[X] Java and Python runtimes are usually on the same machine
[ ] Java and Python runtimes may be on separate machines on the local network.
[ ] Java and Python runtimes may be on separate machines on internet-like network (slower than LAN)

Comments:

The concrete use cases I have are all on one machine.

There is increased interest in compute farm use, however the general view at the moment is that we only plan to pass metadata around between machines in this use case. For example, in the workflow case, there is a Java driver on one machine farming off jobs to other machines. However, it is inefficient for the driver machine to have all data pass through it.

Datatype

Please describe the origin and destination datatype of your usage scenario. e.g., Python: bytearray to Java: byte[]. Mention as many pairs as needed.

Pair 1: Python NumPy ndarray <-> Eclipse January IDataset
Pair 2:
Pair 3:

Comments:

Note the Eclipse January project is a new project. The source of the code is the Diamond Light Source IDataset implementation. We are currently extracting out the common/general parts of that code base for wider use. https://github.com/eclipse/dawnsci/blob/master/org.eclipse.dawnsci.analysis.api/src/org/eclipse/dawnsci/analysis/api/dataset/IDataset.java

IO Transfer type

Check the most common scenario

[X] Blocking IO - Transparent API. e.g., methods only return once all the bytes are transferred. something like: bytes = getBytes() # bytes is now a bytearray or byte[]
[ ] Blocking IO - Streaming API. e.g., you can transfer the bytes at your own pace once you have a reference to the data structure. something like: bytes = getBytes() stream = stream(bytes) chunk = stream.receive(1024) # loop to get all chunks
[ ] Async IO - Please give an example of hypothetical API usage
[ ] Another strategy?

Comments:

We have not explored the options with streaming APIs fully. What we have started looking at, that is quite similar, is lazy transfers. We already use the lazyness concept to handle loading very large files in Java - https://github.com/eclipse/dawnsci/blob/master/org.eclipse.dawnsci.analysis.api/src/org/eclipse/dawnsci/analysis/api/dataset/ILazyDataset.java - and numpy has similar support on Python side. It is possible that transparently we use Blocking IO to transfer data, but that the data itself is only transferred on actual access.

Memory consideration

If Python and Java are executing on the same machine and you want to transfer 1 GB of data from Java to Python, the same gigabyte will now take twice the memory: 1 GB in Java and 1 GB in Python.

[X] That's fine.
[ ] Please provide a way to discard the data on the sending side as it is sent to the other side.

Comments:

General comments and description of your use case

In general, we see doing binary transfers transparently between processes as something that works up to a certain scale of data. The target upper end is 100 million data points, typically of double precision data. Beyond that size, the data starts to need more application specific management, and a set of meta data is needed to handle the transfers. Using pandas (and equivalents on Java side) to describe more compicated collections of data is on the table going forward, but they have not been fleshed out properly yet.

Jul 05 '16 10:07 jonahgraham

@jonahkichwacoders how do you serialize ndarray to IDataset (and conversely)?

Jul 05 '16 11:07 bartdag

On Python we use numpy.save and load, see pyflatten.py#ndArrayHelper for example of use.

On Java we have load/save implemented to the IDatasert here: loader and saver

The file format is described at numpy: npy-format.txt. It is a very simple format, just the raw data with some simple headers.

Jul 05 '16 12:07 jonahgraham

From what I see in the code, if Py4J gives you on the Java side (1) a ByteBuffer or a Channel, (2) the size of the payload, and (3) some facilities to easily register custom type converters, you could reuse this code easily right?

Jul 06 '16 08:07 bartdag

Note: 100 million datapoints with double precision would be around 763 megabytes (100000000*8/1024/1024) which still fits in a byte array on the Java side.

Jul 06 '16 08:07 bartdag

From what I see in the code, if Py4J [...]

Yes, I think that is right. Interestingly, I don't know if (1) is even needed as when we did the initial work on this 5 years ago we determined that writing to temp files (relying on system cache to keep performance up) was much faster than any way we could stream the data. That is however an assumption that needs to be revisited, especially with MS optimizing the loopback performance.

Note: 100 million datapoints with double precision would be around 763 megabytes (100000000*8/1024/1024) which still fits in a byte array on the Java side.

Yes, In our experience, in numerous different scenarios (only some rpc related) dealing with datasets less than 1GB tends to just work without too much thought, but above that more and more secondary issues come into play.

Jul 06 '16 09:07 jonahgraham

@JoshRosen @davies @kaytwo If you are still using Py4J and could comment on this issue, that would be great. Thanks!

Jul 06 '16 13:07 bartdag

I'll defer to the Spark committers, they know far more about how it is used than I do.

Jul 06 '16 15:07 kaytwo

I have a binary transfer needs , (1)transfer binary data from Java to Python, (2)Java and Python runtimes are usually on the same machine,(3)Java new String(Base64.encodeToChar(byte[], false)) to Python base64.b64decode(self._read_bag.read(size))，Each data about 4 Mega byte.

With base64 coding data transmission normally, but the performance is very poor， There are other better way?

extremely grateful！

Sep 01 '16 12:09 zbvictory

py4j
py4j copied to clipboard

Binary Transfer - Use Cases

Binary Transfer Survey

Direction

Remote

Datatype

IO Transfer type

Memory consideration

General comments and description of your use case

py4j py4j copied to clipboard

Binary Transfer - Use Cases

Binary Transfer Survey

Direction

Remote

Datatype

IO Transfer type

Memory consideration

General comments and description of your use case

py4j
py4j copied to clipboard