python-cx_Oracle icon indicating copy to clipboard operation
python-cx_Oracle copied to clipboard

plan to support Oracle direct-path api for faster bulk inserts?

Open gitpickle opened this issue 6 years ago • 20 comments

Oracle provides a well-known mechanism commonly refereed to as the "Direct Path API" which appends rows to the end of a table for faster bulk insertion. Additionally, Oracle offers (for free) the ability to compress blocks written in this manner based on repeating values. The current cx_oracle executemany code does not seem to leverage or offer a way to leverage the Direct Path API. The API is accessed commonly through the /*+ append */ hint. Are there plans for cx_oracle to natively support direct path insertion? Thanks for your help! Mike

gitpickle avatar Nov 14 '19 14:11 gitpickle

Data loading is something that SQL Loader already does well.

We have no immediate plans to investigate using the Oracle Call Interface Direct Path Load API for cx_Oracle, but I know it would be nice to have.

cjbj avatar Nov 14 '19 22:11 cjbj

Just to be clear: usign the /*+ append */ hint should work with cx_Oracle just as well as it does in SQL*Plus or any other client. The other method that Chris mentioned is a different API altogether which bypasses some of the other overhead associated with binding data.

anthony-tuininga avatar Nov 14 '19 22:11 anthony-tuininga

Hi, sorry for the confusion. I am referring to the latter - using the separate OCI API with reduced overhead. With Simple Compression turned on for the table (offered for free from Oracle) considerable space can be saved, 40-60%. This only kicks in for Direct Path loaded blocks. Additionally we have found that insert times can be significantly impacted, up to 50% faster load times.

To me this is more than a "nice to have" requirement. All production quality ETL tools support Oracle Direct Path insertion e.g. SAP Data Services, Oracle Integrator, and Informatica. We are trying to displace an ETL tool with Python data pipelines and we cannot accomplish this without Direct Path support for these reasons.

If this were to be implemented would the developer have to work through the C OCI calls? May I assume cx_oracle calls C which then calls OCI? I am thinking this is pretty difficult development do you have a sense for it? I may be willing to pitch in but I am concerned it may be very involved.

Thanks for your thoughts and advice. Appreciate your time! Mike.

On Thu, Nov 14, 2019 at 5:41 PM Anthony Tuininga [email protected] wrote:

Just to be clear: usign the /*+ append / hint should work with cx_Oracle just as well as it does in SQLPlus or any other client. The other method that Chris mentioned is a different API altogether which bypasses some of the other overhead associated with binding data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZPL45ORA6HICXYVQT3QTXHXTA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEDSCIA#issuecomment-554115360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZILV23PUUPBOCUXZADQTXHXTANCNFSM4JNMNEOA .

gitpickle avatar Nov 14 '19 23:11 gitpickle

50% faster than executeMany()? Do you have Python benchmark data?

The Oracle Call Interface direct path load API is not small, so learning & implementing & testing would be a solid effort. It would need to be added to ODPI-C first, which @anthony-tuininga would have to do (we don't accept PR's for ODPI-C for various reasons). Then this would be wrapped in cx_Oracle calls. Typically Anthony would do the cx_Oracle API while doing the ODPI-C interface since it makes testing easier.

Can you post a brain-storm cx_Oracle API and show how you would use it?

cjbj avatar Nov 15 '19 01:11 cjbj

Up to 50 percent faster using an etl tool using direct path vs regular insert. Data Services has settings for this, the improvement is environment and data dependent obviously.

Yes I will propose an API I think generally the API would be very similar to executemany with an addition or two to indicate a desired direct path load. I will make a proposal. Thanks for your help. Mike

On Thu, Nov 14, 2019, 8:29 PM Christopher Jones [email protected] wrote:

50% faster than executeMany()? Do you have Python benchmark data?

The Oracle Call Interface direct path load API is not small, so learning & implementing & testing would be a solid effort. It would need to be added to ODPI-C first, which @anthony-tuininga https://github.com/anthony-tuininga would have to do (we don't accept PR's for ODPI-C for various reasons). Then this would be wrapped in cx_Oracle calls. Typically Anthony would do the cx_Oracle API while doing the ODPI-C interface since it makes testing easier.

Can you post a brain-storm cx_Oracle API and show how you would use it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZLB4N3INQNQYSW664LQTX3N5A5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEED62BI#issuecomment-554167557, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZKYHSEURXNVJWTNV5TQTX3N5ANCNFSM4JNMNEOA .

gitpickle avatar Nov 15 '19 02:11 gitpickle

Can you do a quick benchmark with executeMany() in Python and compare that with your ETL direct path? I am very curious.

cjbj avatar Nov 15 '19 02:11 cjbj

I will. We are loading millions of rows. I have copied the benefits below from a white paper from Oracle. https://docs.oracle.com/cd/B10501_01/server.920/a96652/ch09.htm

The paper lists the multitude of reasons why the direct path API is faster, and please don't forget about the compression benefit as well. Direct loaded blocks are compressed with Basic Compression turned on for the table.

The bottom line is that an Oracle direct path load is a totally different mechanism for loading data into a table. No inserts are executed internally by Oracle. Data is preformatted into blocks and appended to the end of the table offering many performance benefits. The internals are totally different.

The below information is from Oracle. Advantages of a Direct Path Load

A direct path load is faster than the conventional path for the following reasons:

  • Partial blocks are not used, so no reads are needed to find them, and fewer writes are performed.
  • SQL*Loader need not execute any SQL INSERT statements; therefore, the processing load on the Oracle database is reduced.
  • A direct path load calls on Oracle to lock tables and indexes at the start of the load and releases them when the load is finished. A conventional path load calls Oracle once for each array of rows to process a SQL INSERT statement.
  • A direct path load uses multiblock asynchronous I/O for writes to the database files.
  • During a direct path load, processes perform their own write I/O, instead of using Oracle's buffer cache. This minimizes contention with other Oracle users.
  • The sorted indexes option available during direct path loads allows you to presort data using high-performance sort routines that are native to your system or installation.
  • When a table to be loaded is empty, the presorting option eliminates the sort and merge phases of index-building. The index is filled in as data arrives.
  • Protection against instance failure does not require redo log file entries during direct path loads. Therefore, no time is required to log the load when:
    • Oracle is operating in NOARCHIVELOG mode
    • The UNRECOVERABLE parameter is set to Y
    • The object being loaded has the NOLOG attribute set

On Thu, Nov 14, 2019, 9:57 PM Christopher Jones [email protected] wrote:

Can you do a quick benchmark with executeMany() https://cx-oracle.readthedocs.io/en/latest/user_guide/batch_statement.html in Python and compare that with your ETL direct path? I am very curious.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZKBHPAPQLSKHDJVJMLQTYFZHA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEED4HI#issuecomment-554188317, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZNPE4VAVAIGUQGJAU3QTYFZHANCNFSM4JNMNEOA .

gitpickle avatar Nov 15 '19 11:11 gitpickle

@gitpickle how did your benchmark go?

cjbj avatar Dec 04 '19 09:12 cjbj

Hi Chris! Sorry about taking so long to get back.

I have not been able to construct a benchmark due mainly to being busy but I wanted to make one more plea for this request. I don't know if I will be able to find time to construct a benchmark.

The functionality from Oracle exists for a reason. When using /*+ APPEND */, which uses the direct path loading mechanism, rows are inserted in a fundamentally different way from a traditional path load. The buffer cache is bypassed and blocks are appended above the high water mark in the table. For inserting large numbers of rows (e.g. millions, tens of millions, hundreds of millions, etc.) performance can be much improved. How much? It depends on environment specifics!

Contention on the redo logs can also be resolved if tables are being loaded in parallel. If the table is set to NOLOGGING, when bulk loading Oracle does not write data to the redo logs. Another performance benefit!

And finally, as I mentioned previously as well Oracle supports BASIC compression which compresses blocks written with the direct path API. This type of compression is free and saves space and further improves performance by writing less data to disk because it is compressed (based on repeated values - they are only written once per block). Obviously its effectiveness is dependent on the characteristics of the actual data. We have saved up to 50% space by using Basic compression. This only kicks in during a direct-path insert.

I absolutely don't mean to be argumentative at all and I greatly appreciate all the work of you and your team!!! The differences in how Oracle loads tables with a direct path load is pretty well documented. Given these architectural differences I am not really seeing the reason for running benchmarks. I could easily contrive something that works the way I want it to for instance based on my environment, which may be different than most other environments, who knows. We may have a slow SAN, whereas others may have a fast one where they have other bottlenecks.

What I am simply saying is that Oracle documents how the direct-path insert mechanism is fundamentally different. With our SAP Data Services tool, we have had to use the bulk loading mechanism (APPEND mode in DS) extensively at Fox Chase Cancer Center to save space and to speed up our "Initial" loads. Our tool will not make a round trip for every row regardless of whether or not we use the bulk loader. Our performance improvements are due to the increased efficiency of the insert mechanism Oracle employs in a direct path load, rather than reducing round trips. I am aware that executemany avoids the round trips. What I am saying is needed to compete with ETL tools is support for the direct path API (append based insertion).

I am hoping this makes sense and I 100% understand if you simply don't have time to implement the feature. I get it. But at the same time I also feel it is important.

One other thing I will mention, is that SQLAlchemy should turn around and leverage the feature if you implement it. We are using SQLAlchemy, which in turn can obviously use cx_oracle.

Hope this helps, Mike

On Wed, Dec 4, 2019 at 4:01 AM Christopher Jones [email protected] wrote:

@gitpickle https://github.com/gitpickle how did your benchmark go?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZKML5IVQYM64Z2ZDBLQW5WXJA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF4HPTI#issuecomment-561543117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZIWN474QPN4FJOAD6TQW5WXJANCNFSM4JNMNEOA .

gitpickle avatar Dec 11 '19 21:12 gitpickle

The decision factors are standard ones (i) implementation complexity and maintenance costs (ii) performance benefits (iii) usability benefits (iv) whether leaving the feature to existing Oracle tools that do it is a wiser allocation of responsibilities.

cjbj avatar Dec 11 '19 22:12 cjbj

gotcha, makes perfect sense Chris thank you. Mike

On Wed, Dec 11, 2019 at 5:37 PM Christopher Jones [email protected] wrote:

The decision factors are standard ones (i) implementation complexity and maintenance costs (ii) performance benefits (iii) usability benefits (iv) whether leaving the feature to existing Oracle tools that do it is a wiser allocation of responsibilities.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZLUTBZCTW6PC4YRNK3QYFTRZA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU2B6Y#issuecomment-564764923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZNZYNXMPBP3U5H6WHTQYFTRZANCNFSM4JNMNEOA .

gitpickle avatar Dec 11 '19 22:12 gitpickle

@gitpickle since you have data, you could help with (ii).

cjbj avatar Dec 11 '19 22:12 cjbj

I will try to get numbers

On Wed, Dec 11, 2019, 5:41 PM Christopher Jones [email protected] wrote:

@gitpickle https://github.com/gitpickle since you have data, you could help with (ii).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369?email_source=notifications&email_token=ANYVZZKSTKYUCSLIY3NV3MTQYFUALA5CNFSM4JNMNEOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU2LWI#issuecomment-564766169, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZK5JO25CJAG2MFNE7DQYFUALANCNFSM4JNMNEOA .

gitpickle avatar Dec 12 '19 00:12 gitpickle

I also hope to have the function of direct path loading without giving up cx_Oracler

qianxuanyon avatar May 08 '21 00:05 qianxuanyon

yes it would be very nice to support this important Oracle functionality

On Fri, May 7, 2021 at 8:54 PM 李道然 @.***> wrote:

I also hope to have the function of direct path loading without giving up cx_Oracler

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/python-cx_Oracle/issues/369#issuecomment-834914961, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANYVZZP5TUBDN4EV7C3GMEDTMSDSXANCNFSM4JNMNEOA .

gitpickle avatar May 08 '21 00:05 gitpickle

I think so too. It's on my long wish-list, but it's a big project. API suggestions, testcases etc welcome

cjbj avatar May 08 '21 02:05 cjbj

Will there be compressed data when cx oracle reads data transmission and writes to a remote server to increase the transmission speed?

qianxuanyon avatar Jul 01 '21 16:07 qianxuanyon

@qianxuanyon the feature hasn't been investigated yet.

cjbj avatar Jul 01 '21 21:07 cjbj

I currently have a scenario where a csv.gz file needs to be written to a remote database Two methods of operation are used

  1. Directly read from cx_Oracle and then executemany to insert into the database
  2. Upload the file to a server near the database and then use cx_Oracle

The second method plus the data upload time is much faster than the first method

So is it possible for us to compress data during the transmission of cx_Oracle to increase the transmission speed?

qianxuanyon avatar Jul 02 '21 02:07 qianxuanyon

@qianxuanyon that is a different topic to this enhancement request. Can you start a new issue?

cjbj avatar Jul 02 '21 02:07 cjbj