datatable Create Frame from generator

Possible to allow the creation of a Frame from a generator? A lot of APIs will output a generator and it seems strange to have to materialize that data into a list just to then re-materialize that list into a Frame.

Mar 21 '22 18:03 amachanic

@amachanic creating a Frame would require materialisation from the generator either ways, right?

Mar 22 '22 01:03 samukweku

I see couple of issues with allowing generators in dt.Frame():

there is no way to know a number of elements returned by a generator without exhausting it;
there is no way to allow parallel access to generators;
if we allow it, we would need to request python objects from a generator, then convert them to C++ primitives anyways.

Mar 22 '22 17:03 oleksiyskononenko

@samukweku Yes, materialization would be required either way, of course. But it's different materialization. (i.e., if I pass in a list of dicts, the Frame internally doesn't keep the data as a list of dicts.)

@oleksiyskononenko I don't know how much the first point matters in terms of memory allocations you do internally. The second and third points are both kind of irrelevant, in my opinion, if you're making the user pre-materialize. A) One can't materialize from a generator to a list in parallel; and B) You'll have to convert the elements to C++ primitives at some level in either case.

I've noticed on some other threads that Pandas comparisons are used - so in case it helps, Pandas does support this.

Mar 22 '22 18:03 amachanic

The first point is not just about the memory allocation, but also about the validation of the input to dt.Frame(). Also, generators could be infinite. What should we do in this case? I imagine we can pass a number of elements along with the generator, but is this number of elements even known by the user at the time when the frame is constructed?

The second point I mentioned because of the way we create columns in datatable. In some cases we can make columns virtual and do calculations on-the-fly, like in the case of the python range(). In the case of the generators we can not just allow parallel access to generators, but also can not allow even random access. It means that we will have to loop through generator internally in a single thread and create a materialized column.

The third point is just to note that we're not gaining anything here in terms of creation time, but I agree this would consume less memory.

Mar 22 '22 18:03 oleksiyskononenko

Yes, agreed with your final point. At the end of the day this is all about memory -- it's always at a premium in Python, which is why we (on my end at least) often exploit generators. But it does sound like there are some Frame internals that make this a bigger lift than I expected; thanks for sharing that insight.

Mar 22 '22 19:03 amachanic

You're welcome. I think this is doable, but may have some technical difficulties. I don't have experience with generators in pandas, what happens if the generator passed is infinite?

Mar 22 '22 19:03 oleksiyskononenko

It did pretty much what I expected: It sat there for a long time until it used up all of the RAM, and then the OS had enough and took it down.

Mar 22 '22 19:03 amachanic

@amachanic I've measured memory consumption of these two code blocks

import pandas as pd
PD = pd.DataFrame([i for i in range(10**8)])

and

import pandas as pd
PD = pd.DataFrame((i for i in range(10**8)))

and found that it is almost identical. It means that even though pandas allows frame creation from generators, it materializes generator to something like a list behind the scene. So there is no gain in memory when you use generator as a source. Could you please double check it on your side?

My feeling is that this could be implemented in a one-by-one fashion: requesting one element from a generator, converting it to C++ primitive, do it again — but it may not be efficient from the performance point of view.

Mar 23 '22 22:03 oleksiyskononenko