embulk-input-jdbc
embulk-input-jdbc copied to clipboard
Limit number of records fetch from database for preview command
Currently, the embulk preview
command fetch all the records and it uses the same logic as embulk run ...
command, so it will take a lot of time to run it on the table with big volume size. This PR will add the logic to limit the records fetch from database when executing the embulk preview ...
command.
- Why 50? Do you have any specific reason why 50 is good there? Or, shouldn't it be user-configurable?
Actually, I don't know which number is best, so just pick random 50 records for preview is enough. It's better to move it to configurable, but I don't think we need it because it's just for display some sample record in our console UI.
I don't think we need it because it's just for display some sample record in our console UI.
Who are "our" ? Please be aware that this embulk-input-jdbc
is a public open source project. It's not "owned" by you, nor by your company. It is owned by the Embulk project community.
The Embulk team is unaware of your context. Please make a commit that is reasonable for all users of embulk-input-jdbc
in the world, not only for you nor for your company. The same for the pull request subject and description.
Got your point and noted. Let discuss about the number of limit record. In my opinion hard code number is enough. If the connector needs to get it from config, we must pass config to class XXXInputConnection
.
Hello, @joe-td
I have one concern. Does this extension support incremental loading and raw query? If not, does this PR skip limitation in those cases? https://github.com/embulk/embulk-input-jdbc/tree/master/embulk-input-jdbc#use-incremental-loading-with-raw-query
Hi @hiroyuki-sato
This patch is only impact on embulk preview ...
command, so there is no issue with incremental loading. About raw query, I think it can work. It's just rebuild the query by append limit/top to original query.
Hello, @joe-td
A user can write any query like this. Does this PR append limit/top
statement if a user use query
attribute?
Or does this PR append limit/top
statement if user uses table
attribute only?
in:
type: postgresql
user: user
host: localhost
database: embulk_test
# table: sample
query: select * from sample limit 1
If a user uses limit
statement in the query
attribute, it may cause like a following error.
select * from sample limit 1 limit 1;
ERROR: syntax error at or near "limit"
LINE 1: select * from sample limit 1 limit 1;
Update at (2023-06-21)
In my optinion, If preview limit supports table
attribute only, new attribute ex) preview_limit
may be useful.
If we also support the query
attribute, I think It is necessary to parser SQL statements.
in:
type: postgresql
user: user
host: localhost
database: embulk_test
table: sample
incremental_columns:
- id
incremental: true
last_record: [0]
#
# new preview option.
#
preview_limit: 50 # null (or 0) by default means unlimited
Hi @hiroyuki-sato
Yes, you're right. If user input raw query and the query contain top/limit, the current logic still append it and cause a problem. It's much more complicated to handle raw query, i.e., there are a lot of nested query level so in that case we don't know how to append top/limit into the query. Maybe we can introduce a new preview_query
config for user input it, but from user experience, I think it's not a good way.
Hello, @joe-td
I think preview_query
is a little complicated if the user use table
attribute only.
Just for one idea.
-
preview_limit
: If a configuration contains atable
attribute, it appendslimit <preview_limit>
at the end of the query in the preview mode. -
preview_query
: A configuration containing aquery
attribute usespreview_query
in the preview mode. If the configuration doesn't havepreview_query
, It uses thequery
attribute as is.
Plan B. Introduce limit place mark ie.<limit_stmt>
.
in:
type: postgresql
user: user
host: localhost
database: embulk_test
# table: sample
query: select * from sample <limit_stmt>
Thanks @hiroyuki-sato for your suggestion. Actually I don't like to add more configuration for user, it can put user to spend more time on correct the config and easier to make error. One more thing about the raw query is that we don't know the best place to put top/limit into the original query. I checked the JDBC driver and found out that they provide the method setMaxRows()
https://docs.oracle.com/javase/8/docs/api/java/sql/Statement.html#setMaxRows-int-, I think we can use this without introducing configuration or manipulate the query.
Hello, @joe-td.
If setMaxRows()
works as you expected, the implementation will be more straightforward. 👍
Hello, @joe-td. Thank you for improving the PR.
The Embulk project is carefully keeping the current behavior as far as possible. So, I'm worrying about changing the default preview limit suddenly.
Keeping current behavior and changing behavior using configuration may be helpful. Fortunately, Embulk 0.11.0 introduce EEP-4 System-wide Configuration by Embulk System Properties and Embulk Home. If Saas provider sets preview value globally using this setting, a user doesn't worry about preview size.
This is just my opinion. I'll support @dmikurbe and @hito4t opinions.
The configuration would be similar to preview_sample_buffer_bytes
for File Input Plugins, and I found preview_sample_rows
in the core.
https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L32-L58
I do not know actual example cases that Input Plugins use this configuration explicitly, but it might help somehow, or it would provide some hints.
I think the constant MAX_PREVIEW_RECORDS
is just limit number of record for embulk preview ...
and it avoid to scan all records from table to make query a bit faster for preview and it does not impact to embulk run ...
. After the records is fetch from database, it will use the preview processor to display preview the data. In my opinion the purpose of embulk preview
is display some sample data, so I don't see any risk at the moment.
I understand your point, but for example, if there is a user who has used embulk preview
to check 100+ records, the user will not have an option to keep usint it. Compatibility is not always mandatory, but we need an appropriate communication when breaking the compatibility, and we should avoid breaking the compatibility unless it's very difficult. In addition, even if we change the default behavior, we should leave an option to achieve compatible behavior if possible. In this case, it's limited to 100
and there is no way to avoid. It's not good.
Unfortunately, there seem no ways to pass the core's preview_sample_rows
to an Input Plugin. (It could be a good point to improve the core side, but it's an independent topic.)
My suggestion is:
- Add a new configuration
preview_sample_rows
for these plugins. There would be no limit if it'snull
. The default would benull
, but nice to print a warning log message that the preview query has no limit. - You could have
OptionalInt limit
, instead ofboolean isPreview
, as the new parameter ofnewSelectCursor
. The code changes could be easier, and more visible.
Thanks for your feedback and suggestion. Totally agree with you about the make it configurable. But currently behavior is the plugin print only 127 records for the samples even if we does not limit the records fetch from DB. There are 2 thing that we need to make it clear:
- Limit number of row fetch from DB for preview flow (currently there is no limit, it will scan all rows but fetch only 10000 records from cursor via
fetch_rows
config attribute, and I don't know why it prints only 127 lines) - Limit number of line print for preview ( current it config to 15 lines)
Okay, then can you investigate why it prints only 127 lines? Depending on the investigation, we may agree that limiting by default in preview would be reasonable.
Jfyi, filed this for the core. https://github.com/embulk/embulk/issues/1623
I think the embulk internally handle page by page with limit page size is 32768 bytes. So, the number of preview lines will be very and depend on size of record and maximum size of preview line limit to embulk page size.
- https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PooledBufferAllocator.java#L46
- https://github.com/embulk/embulk-input-jdbc/blob/master/embulk-input-jdbc/src/main/java/org/embulk/input/jdbc/AbstractJdbcInputPlugin.java#L491
- https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L175
Yeah, the internal Page
/Buffer
is limited to 32768, but it's the same in run
, not just preview
. Can you elaborate why it's consequently limited only in preview?
In the preview it just handle first page only: https://github.com/embulk/embulk/blob/v0.11.0/embulk-core/src/main/java/org/embulk/exec/PreviewExecutor.java#L175 and skip the remaining page.
That is just checking the number of records (rows), not pages?
As my understand, it consumed first page (32678 bytes) and check number of rows from first page with default sample row config ?
PageOutput
is used repeatedly for multiple Page
s. recordCount
can be accumulated among multiple Page
s.
It looks to me that SamplingPageOutput
would accept the second Page
if the first Page
's record count (PageReader#getRecordCount(Page)
) is less than sampleRows
.
If you're showing it's limited to a single Page
, you'll have to confirm SamplingPageOutput#add(Page)
is called only once.
Yeah! the SamplingPageOutput
can accept more than one page and after processing the page, the recordCount
will accumulate and check with sample row config (15 lines). Back to my example, the first page has more than 15 records already, so it skip to subsequence page.
So, the limit comes from sampleRow
, which comes from the global preview_sample_rows
.
Then let's get back. If there is a user who has used embulk preview to check 100+ records, the user would set the global preview_sample_rows
to 100+.
But, your change will not work for this case. Even if the user sets 100+ for the global preview_sample_rows
, the preview result would always be limited to 100. The user would have no any option to avoid it.
Thanks @dmikurube. It makes sense now. Let me add preview_sample_rows
config attribute for this case. But I think it's better to transfer preview_sample_rows
from core to plugin.
But I think it's better to transfer
preview_sample_rows
from core to plugin.
Agreed, but it'd be a long journey as it needs to change the contract between the core and plugins. If you need to limit this sooner, this is the only way.
Thanks @hito4t for your comments. You're right, the performance will depend on the query time. So, I think will close this PR.