embulk-input-jdbc icon indicating copy to clipboard operation
embulk-input-jdbc copied to clipboard

Limit number of records fetch from database for preview command

Open joe-td opened this issue 1 year ago • 27 comments

Currently, the embulk preview command fetch all the records and it uses the same logic as embulk run ... command, so it will take a lot of time to run it on the table with big volume size. This PR will add the logic to limit the records fetch from database when executing the embulk preview ... command.

joe-td avatar Jun 20 '23 04:06 joe-td

  1. Why 50? Do you have any specific reason why 50 is good there? Or, shouldn't it be user-configurable?

Actually, I don't know which number is best, so just pick random 50 records for preview is enough. It's better to move it to configurable, but I don't think we need it because it's just for display some sample record in our console UI.

joe-td avatar Jun 21 '23 06:06 joe-td

I don't think we need it because it's just for display some sample record in our console UI.

Who are "our" ? Please be aware that this embulk-input-jdbc is a public open source project. It's not "owned" by you, nor by your company. It is owned by the Embulk project community.

The Embulk team is unaware of your context. Please make a commit that is reasonable for all users of embulk-input-jdbc in the world, not only for you nor for your company. The same for the pull request subject and description.

dmikurube avatar Jun 21 '23 07:06 dmikurube

Got your point and noted. Let discuss about the number of limit record. In my opinion hard code number is enough. If the connector needs to get it from config, we must pass config to class XXXInputConnection.

joe-td avatar Jun 21 '23 09:06 joe-td

Hello, @joe-td

I have one concern. Does this extension support incremental loading and raw query? If not, does this PR skip limitation in those cases? https://github.com/embulk/embulk-input-jdbc/tree/master/embulk-input-jdbc#use-incremental-loading-with-raw-query

hiroyuki-sato avatar Jun 21 '23 10:06 hiroyuki-sato

Hi @hiroyuki-sato This patch is only impact on embulk preview ... command, so there is no issue with incremental loading. About raw query, I think it can work. It's just rebuild the query by append limit/top to original query.

joe-td avatar Jun 21 '23 11:06 joe-td

Hello, @joe-td

A user can write any query like this. Does this PR append limit/top statement if a user use query attribute? Or does this PR append limit/top statement if user uses table attribute only?

in:
  type: postgresql
  user: user
  host: localhost
  database: embulk_test
#  table: sample
  query: select * from sample limit 1

If a user uses limit statement in the query attribute, it may cause like a following error.

select * from sample limit 1 limit 1;
ERROR:  syntax error at or near "limit"
LINE 1: select * from sample limit 1 limit 1;

Update at (2023-06-21)

In my optinion, If preview limit supports table attribute only, new attribute ex) preview_limit may be useful. If we also support the query attribute, I think It is necessary to parser SQL statements.

in:
  type: postgresql
  user: user
  host: localhost
  database: embulk_test
  table: sample
  incremental_columns:
    - id
  incremental: true
  last_record: [0]
  #
  # new preview option.
  # 
  preview_limit: 50 # null (or 0) by default means unlimited

hiroyuki-sato avatar Jun 21 '23 11:06 hiroyuki-sato

Hi @hiroyuki-sato Yes, you're right. If user input raw query and the query contain top/limit, the current logic still append it and cause a problem. It's much more complicated to handle raw query, i.e., there are a lot of nested query level so in that case we don't know how to append top/limit into the query. Maybe we can introduce a new preview_query config for user input it, but from user experience, I think it's not a good way.

joe-td avatar Jun 22 '23 02:06 joe-td

Hello, @joe-td

I think preview_query is a little complicated if the user use table attribute only.

Just for one idea.

  • preview_limit: If a configuration contains a table attribute, it appends limit <preview_limit> at the end of the query in the preview mode.
  • preview_query: A configuration containing a query attribute uses preview_query in the preview mode. If the configuration doesn't have preview_query, It uses the query attribute as is.

Plan B. Introduce limit place mark ie.<limit_stmt> .

in:
  type: postgresql
  user: user
  host: localhost
  database: embulk_test
#  table: sample
  query: select * from sample <limit_stmt>

hiroyuki-sato avatar Jun 22 '23 06:06 hiroyuki-sato

Thanks @hiroyuki-sato for your suggestion. Actually I don't like to add more configuration for user, it can put user to spend more time on correct the config and easier to make error. One more thing about the raw query is that we don't know the best place to put top/limit into the original query. I checked the JDBC driver and found out that they provide the method setMaxRows() https://docs.oracle.com/javase/8/docs/api/java/sql/Statement.html#setMaxRows-int-, I think we can use this without introducing configuration or manipulate the query.

joe-td avatar Jun 22 '23 07:06 joe-td

Hello, @joe-td.

If setMaxRows() works as you expected, the implementation will be more straightforward. 👍

hiroyuki-sato avatar Jun 22 '23 12:06 hiroyuki-sato

Hello, @joe-td. Thank you for improving the PR.

The Embulk project is carefully keeping the current behavior as far as possible. So, I'm worrying about changing the default preview limit suddenly.

Keeping current behavior and changing behavior using configuration may be helpful. Fortunately, Embulk 0.11.0 introduce EEP-4 System-wide Configuration by Embulk System Properties and Embulk Home. If Saas provider sets preview value globally using this setting, a user doesn't worry about preview size.

This is just my opinion. I'll support @dmikurbe and @hito4t opinions.

hiroyuki-sato avatar Jun 23 '23 02:06 hiroyuki-sato

The configuration would be similar to preview_sample_buffer_bytes for File Input Plugins, and I found preview_sample_rows in the core.

https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L32-L58

I do not know actual example cases that Input Plugins use this configuration explicitly, but it might help somehow, or it would provide some hints.

dmikurube avatar Jun 23 '23 02:06 dmikurube

I think the constant MAX_PREVIEW_RECORDS is just limit number of record for embulk preview ... and it avoid to scan all records from table to make query a bit faster for preview and it does not impact to embulk run .... After the records is fetch from database, it will use the preview processor to display preview the data. In my opinion the purpose of embulk preview is display some sample data, so I don't see any risk at the moment.

joe-td avatar Jun 23 '23 03:06 joe-td

I understand your point, but for example, if there is a user who has used embulk preview to check 100+ records, the user will not have an option to keep usint it. Compatibility is not always mandatory, but we need an appropriate communication when breaking the compatibility, and we should avoid breaking the compatibility unless it's very difficult. In addition, even if we change the default behavior, we should leave an option to achieve compatible behavior if possible. In this case, it's limited to 100 and there is no way to avoid. It's not good.

Unfortunately, there seem no ways to pass the core's preview_sample_rows to an Input Plugin. (It could be a good point to improve the core side, but it's an independent topic.)

My suggestion is:

  • Add a new configuration preview_sample_rows for these plugins. There would be no limit if it's null. The default would be null, but nice to print a warning log message that the preview query has no limit.
  • You could have OptionalInt limit, instead of boolean isPreview, as the new parameter of newSelectCursor. The code changes could be easier, and more visible.

dmikurube avatar Jun 23 '23 06:06 dmikurube

Thanks for your feedback and suggestion. Totally agree with you about the make it configurable. But currently behavior is the plugin print only 127 records for the samples even if we does not limit the records fetch from DB. There are 2 thing that we need to make it clear:

  • Limit number of row fetch from DB for preview flow (currently there is no limit, it will scan all rows but fetch only 10000 records from cursor via fetch_rows config attribute, and I don't know why it prints only 127 lines)
  • Limit number of line print for preview ( current it config to 15 lines)

joe-td avatar Jun 23 '23 06:06 joe-td

Okay, then can you investigate why it prints only 127 lines? Depending on the investigation, we may agree that limiting by default in preview would be reasonable.

dmikurube avatar Jun 23 '23 06:06 dmikurube

Jfyi, filed this for the core. https://github.com/embulk/embulk/issues/1623

dmikurube avatar Jun 23 '23 07:06 dmikurube

I think the embulk internally handle page by page with limit page size is 32768 bytes. So, the number of preview lines will be very and depend on size of record and maximum size of preview line limit to embulk page size.

  • https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PooledBufferAllocator.java#L46
  • https://github.com/embulk/embulk-input-jdbc/blob/master/embulk-input-jdbc/src/main/java/org/embulk/input/jdbc/AbstractJdbcInputPlugin.java#L491
  • https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L175

joe-td avatar Jun 23 '23 07:06 joe-td

Yeah, the internal Page/Buffer is limited to 32768, but it's the same in run, not just preview. Can you elaborate why it's consequently limited only in preview?

dmikurube avatar Jun 23 '23 07:06 dmikurube

In the preview it just handle first page only: https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L175 and skip the remaining page.

joe-td avatar Jun 23 '23 07:06 joe-td

That is just checking the number of records (rows), not pages?

dmikurube avatar Jun 23 '23 10:06 dmikurube

As my understand, it consumed first page (32678 bytes) and check number of rows from first page with default sample row config ?

joe-td avatar Jun 23 '23 10:06 joe-td

PageOutput is used repeatedly for multiple Pages. recordCount can be accumulated among multiple Pages.

It looks to me that SamplingPageOutput would accept the second Page if the first Page's record count (PageReader#getRecordCount(Page)) is less than sampleRows.

If you're showing it's limited to a single Page, you'll have to confirm SamplingPageOutput#add(Page) is called only once.

dmikurube avatar Jun 23 '23 13:06 dmikurube

Yeah! the SamplingPageOutput can accept more than one page and after processing the page, the recordCount will accumulate and check with sample row config (15 lines). Back to my example, the first page has more than 15 records already, so it skip to subsequence page.

joe-td avatar Jun 24 '23 02:06 joe-td

So, the limit comes from sampleRow, which comes from the global preview_sample_rows.

Then let's get back. If there is a user who has used embulk preview to check 100+ records, the user would set the global preview_sample_rows to 100+.

But, your change will not work for this case. Even if the user sets 100+ for the global preview_sample_rows, the preview result would always be limited to 100. The user would have no any option to avoid it.

dmikurube avatar Jun 24 '23 06:06 dmikurube

Thanks @dmikurube. It makes sense now. Let me add preview_sample_rows config attribute for this case. But I think it's better to transfer preview_sample_rows from core to plugin.

joe-td avatar Jun 24 '23 10:06 joe-td

But I think it's better to transfer preview_sample_rows from core to plugin.

Agreed, but it'd be a long journey as it needs to change the contract between the core and plugins. If you need to limit this sooner, this is the only way.

dmikurube avatar Jun 25 '23 03:06 dmikurube

Thanks @hito4t for your comments. You're right, the performance will depend on the query time. So, I think will close this PR.

joe-td avatar Aug 13 '24 08:08 joe-td