Create Frame from generator
Possible to allow the creation of a Frame from a generator? A lot of APIs will output a generator and it seems strange to have to materialize that data into a list just to then re-materialize that list into a Frame.
@amachanic creating a Frame would require materialisation from the generator either ways, right?
I see couple of issues with allowing generators in dt.Frame():
- there is no way to know a number of elements returned by a generator without exhausting it;
- there is no way to allow parallel access to generators;
- if we allow it, we would need to request python objects from a generator, then convert them to C++ primitives anyways.
@samukweku Yes, materialization would be required either way, of course. But it's different materialization. (i.e., if I pass in a list of dicts, the Frame internally doesn't keep the data as a list of dicts.)
@oleksiyskononenko I don't know how much the first point matters in terms of memory allocations you do internally. The second and third points are both kind of irrelevant, in my opinion, if you're making the user pre-materialize. A) One can't materialize from a generator to a list in parallel; and B) You'll have to convert the elements to C++ primitives at some level in either case.
I've noticed on some other threads that Pandas comparisons are used - so in case it helps, Pandas does support this.
The first point is not just about the memory allocation, but also about the validation of the input to dt.Frame(). Also, generators could be infinite. What should we do in this case? I imagine we can pass a number of elements along with the generator, but is this number of elements even known by the user at the time when the frame is constructed?
The second point I mentioned because of the way we create columns in datatable. In some cases we can make columns virtual and do calculations on-the-fly, like in the case of the python range(). In the case of the generators we can not just allow parallel access to generators, but also can not allow even random access. It means that we will have to loop through generator internally in a single thread and create a materialized column.
The third point is just to note that we're not gaining anything here in terms of creation time, but I agree this would consume less memory.
Yes, agreed with your final point. At the end of the day this is all about memory -- it's always at a premium in Python, which is why we (on my end at least) often exploit generators. But it does sound like there are some Frame internals that make this a bigger lift than I expected; thanks for sharing that insight.
You're welcome. I think this is doable, but may have some technical difficulties. I don't have experience with generators in pandas, what happens if the generator passed is infinite?
It did pretty much what I expected: It sat there for a long time until it used up all of the RAM, and then the OS had enough and took it down.

@amachanic I've measured memory consumption of these two code blocks
import pandas as pd
PD = pd.DataFrame([i for i in range(10**8)])
and
import pandas as pd
PD = pd.DataFrame((i for i in range(10**8)))
and found that it is almost identical. It means that even though pandas allows frame creation from generators, it materializes generator to something like a list behind the scene. So there is no gain in memory when you use generator as a source. Could you please double check it on your side?
My feeling is that this could be implemented in a one-by-one fashion: requesting one element from a generator, converting it to C++ primitive, do it again — but it may not be efficient from the performance point of view.