python-cx_Oracle icon indicating copy to clipboard operation
python-cx_Oracle copied to clipboard

Usage of generators instead of list/tuples in executemany

Open meisn opened this issue 7 years ago • 9 comments

Answer the following questions:

  1. What is your version of Python? Is it 32-bit or 64-bit? Python 3.6.4/3.7 64bit

  2. What is your version of cx_Oracle? 6.4

  3. What is your version of the Oracle client (e.g. Instant Client)? How was it installed? Where is it installed? 11.2, packaged installation in my company

  4. What is your version of the Oracle Database? 11gR2

  5. What is your OS and version? Windows 7 64bit Enterprise

  6. What compiler version did you use? For example, with GCC, run none

  7. What environment variables did you set? How exactly did you set them?

  8. What exact command caused the problem (e.g. what command did you try to install with)? Who were you logged in as? nothing actually, but intended to use executemany to (bulk)insert large dataset which has to be retrieved from a file before (i.e. parameter list generated). Logged in as a regular user, no sys or any privileged account, personal user has write privileges on the database tables of course

  9. What error(s) you are seeing? I have to insert many records from large (xml) files. I use a function for this, where I also have a generator version. On small files (python process below 300MB) I can generate the list of parameters and use executemany(stmnt, params) on it without any issue. Now I also have some larger files of the same sort, the files being around 90MB which results in memory consumption way above that (other modules involved to produce the parameter set). If I could use my generator instead of the function, i.e. if the whole list would not need to be produced upfront, the memory consumption would be way lower and my process would not fail with these file. I played a bit (testing) with the SQLITE3 module included, where there the generator is accepted in the executemany version of the method (https://docs.python.org/3/library/sqlite3.html) - and compared the two (there my python process ran close to a 1GB of memory with the function but didn't fail) and the generator version consumed some 250MB only, ran a bit longer of course, but that is acceptable when the insert itself matters.

Can there be an enhancement like this in one of the future versions?

meisn avatar Jul 04 '18 19:07 meisn

That's an interesting request and it seems perfectly reasonable to request, too. :-) I'll consider this for a future release.

anthony-tuininga avatar Jul 04 '18 19:07 anthony-tuininga

Hi Anthony, I'm really impressed now. Not just by the response time you did but that you accept this as a request :-) Thank you. I would appreciate the help. Regards, Ronny

meisn avatar Jul 04 '18 19:07 meisn

@anthony-tuininga I guess you are thinking of the internal batching & inserting of data that was once discussed? Otherwise cx_Oracle would still need to instantiate the full data set in memory before calling OCI.

@meisn you would have seen this in Anthony's blog post:

For really huge data sets there may be external buffer or network limits to how many rows can be processed at one time. These limits are based on both the number of rows being processed as well as the "size" of each row that is being processed. The sweet spot can be found by tuning your application. Repeated calls to executemany() are still better than repeated calls to execute().

To work with the current functionality, try calling executemany() multiple times.

@anthony-tuininga I wonder whether direct path loading would be more interesting for large data sets?

cjbj avatar Jul 04 '18 23:07 cjbj

@cjbj, yes, I was thinking of batching the inserts and giving the caller the chance to specify how many rows would be processed in each batch, with a reasonable default. I do hope to take a look at direct path loading as well -- but in both cases batching will be needed for very large data sets.

anthony-tuininga avatar Jul 05 '18 00:07 anthony-tuininga

@cjbj Thanks for the link and the hint. In fact you recommended there to open the "issue" here :-) Not sure about this multiple executemany() - I would need to split my xml file into some reasonable chunks before, also not that easy. But I will investigate this route as well.

meisn avatar Jul 05 '18 12:07 meisn

@meisn, what I think @cjbj is suggesting is that you would create your own version of executemany() in Python code which would accept the iterator. That method would then consume the iterator up to a certain number of rows (100? 1000? 10000?) and then call the real executemany() and continue doing this until the iterator was exhausted. This would be fairly easy to do in Python code, I believe.

anthony-tuininga avatar Jul 05 '18 15:07 anthony-tuininga

@anthony-tuininga I can try that. I just realized that the majority of this issue is my code. I create a list of dictionaries (to have my named parameters and to accommodate for the different types of xml-tags I can have in the files). This works fine up to a certain size of the file but is terrible for the larger ones. I will also need to think of alternatives. Thanks so far, hope you still enhance this once - even if this is just a conviniance.

meisn avatar Jul 05 '18 16:07 meisn

No problem. And I will certainly consider this enhancement request further when I get some time!

anthony-tuininga avatar Jul 05 '18 19:07 anthony-tuininga

@anthony-tuininga I have worked around my issue for now by putting up smaller inputfiles and hence smaller parameter sets. I use dictionaries as parameters, which turned out had probably not been my best choice. But with smaller params and more executemany-calls it works fine. Thanks for your suggestions also to @cjbj

meisn avatar Jul 11 '18 17:07 meisn