drill icon indicating copy to clipboard operation
drill copied to clipboard

DRILL-7978: Fixed Width Format Plugin

Open MFoss19 opened this issue 4 years ago • 21 comments

DRILL-7978: Fixed Width Format Plugin

Description

Developing format plugin to parse fixed width files.

Fixed Width Text File Definition: https://www.oracle.com/webfolder/technetwork/data-quality/edqhelp/Content/introduction/getting_started/configuring_fixed_width_text_file_formats.htm

Documentation

Users can now create a format configuration to parse fixed width files.

Testing

Unit tests added. More to come.

MFoss19 avatar Jul 29 '21 18:07 MFoss19

This pull request introduces 2 alerts when merging 2d17f1baa1595206b02eff9ddbf0723dbbb5e575 into 0c9451e6720e5028e1187067cc6d1957ff998bef - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Oct 18 '21 19:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging 18380eaff486389417f6349164450ed54c35e342 into f4ea90ce3e70065c5db364c5f06452c079c151ac - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Oct 20 '21 21:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging a91be4c633f8fbbecd64200b5ead07c5244e1a2b into f4ea90ce3e70065c5db364c5f06452c079c151ac - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Oct 21 '21 17:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging dc60d28e52d997c272105b8eef96e0d9b06af73e into b6da35ece5a278a5ca72ecc33bafc98ba8f861f6 - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Oct 26 '21 22:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging 05ae3f1dcadfa067a569fd0634f487e5e282d83e into 58ced604ae460a72b73c85dd97bb3a7df756dcbf - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Oct 28 '21 21:10 lgtm-com[bot]

This pull request introduces 2 alerts when merging 9d66f9114ab27429a8366d1d71b15f638574dc45 into 52838ef26e5e3e6b4461c2c656ffada0c64c9e88 - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 05 '21 14:11 lgtm-com[bot]

This pull request introduces 2 alerts when merging 9b95c45f7f5ddbb7543c89a1efac99bda0ebff8c into 52838ef26e5e3e6b4461c2c656ffada0c64c9e88 - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 05 '21 16:11 lgtm-com[bot]

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.

"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}

Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

jnturton avatar Nov 08 '21 08:11 jnturton

@MFoss19 @estherbuchwalter following some recent chat with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.

"fixedwidth": {
  "type": "fixedwidth",
  "extensions": [
    "fwf"
  ],
  "extractHeader": true,
  "trimStrings": true,
  "columnOffsets": [1, 11, 21, 31],
  "columnWidths": [10, 10, 10, 10]
}

Column names and types can already come from a provided schema or aliasing after calls to CAST(). Incidentally, the settings above can be overriden per query using a provided schema too.

There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.

@dzamo In this case, I'd respectfully disagree here. In effect, the configuration is providing a schema to the user, similar to the way the logRegex reader works. In this case, the user will get the best data possible if we can include datatypes and field names in the schema, so that they can just do a SELECT * and not have to worry about casting etc.

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

cgivre avatar Nov 08 '21 13:11 cgivre

Let's consider a real world use case: some fixed width log generated by a database. Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields. You could use however the logRegex reader to do this. That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the mysql_logs folder, it would use the mysql log config, and if it was in the postgres it would use another.

@cgivre This use case would still work after two CREATE SCHEMA statements to set the names and data types, wouldn't it? The schemas would be applied every subsequent query.

My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors.

Let's drop the CASTs, those aren't fun. So we're left with different ways a user can specify column names and types.

  1. With a CREATE SCHEMA against a directory.
  2. With an inline schema to a table function.
  3. With some plugin-specific format config that works for this plugin but generally not for others.

Any one requires some effort, any one gets you to select * returning nice results (disclaimer: is this claim I'm making actually true?) which is super valuable. So shouldn't we avoid the quirky 3 and commit to 1 and 2 consistently wherever we can?

jnturton avatar Nov 08 '21 13:11 jnturton

This pull request introduces 2 alerts when merging f9e96fec406d2ce48810fb1f25291c7d54dcdcfb into 42e7b77b302805156c48952e9e6a7b5b5383dfdc - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 17 '21 22:11 lgtm-com[bot]

This pull request introduces 2 alerts when merging 428a512ec35309c90b254c71735f88b97768f7ca into 14d96d1b6a847f3c07a453f6641993da21a4167c - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 18 '21 20:11 lgtm-com[bot]

This pull request introduces 2 alerts when merging 428a2dd2b6823cd33775746a2933a7bf6dcf702b into 17f3654919b8429b12e321fb4ee837b7a52e06f1 - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 23 '21 22:11 lgtm-com[bot]

This pull request introduces 2 alerts when merging 881d4658e62c106c13364bf1b4b10e66d3f013f7 into 38d0c1d9586b7adc8adfb89f574085cdc26595fe - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 24 '21 17:11 lgtm-com[bot]

This pull request introduces 2 alerts when merging 56d8f6e79d5e33470c120d3071495f631d6f3d1b into 38d0c1d9586b7adc8adfb89f574085cdc26595fe - view on LGTM.com

new alerts:

  • 2 for Unused format argument

lgtm-com[bot] avatar Nov 24 '21 23:11 lgtm-com[bot]

Dear PR author and reviewers.

This is a generic message to say that we would like to merge this PR in time for the 1.20 release. Currently we're targeting a master branch freeze date of 2021-12-10 (10 Dec). Please strive to complete development and review by this time, or indicate that the PR will need more time (and how much).

Thank you.

jnturton avatar Dec 01 '21 13:12 jnturton

Hi @paul-rogers. We're in the throes of trying to convert this plugin to use EVF v2 / scan.v3. This will be the first instance of this kind in the Drill code base, apart from a very simple mock plugin which supports unit tests (CompliantTextBatchReader remains based on EVF v1, from what I can see).

Something that's confusing me is that the EasyFormatPlugin base class is coded against the ManagedReader interface from EVF v1. So I cannot see that we can both derive from EasyFormatPlugin, and also implement ManagedReader from EVF v2. Am I missing something here?

Thanks, James

jnturton avatar Dec 13 '21 13:12 jnturton

@jnturton , thanks for pushing the EVF V2 stuff forward! The EasyFormatPlugin should contain "shims" for the original format, for EVF1 and for EVF2. Given that you said you can't find it, and that the CSV reader is still based on V1, I wonder if there is some branch that never got pushed a PR? I'll do some research to determine what's what.

paul-rogers avatar Dec 14 '21 19:12 paul-rogers

@jnturton, turns out the required changes are sitting in a branch in my private repo, csv, that somehow never not converted to a PR. I'll see if I can merge that stuff into a PR.

paul-rogers avatar Dec 16 '21 16:12 paul-rogers

PR 2419 has the EVF V2 adapter for the Easy Format Plugin. I suggest that you use that code, and follow the example there, to add EVF V2 support here.

paul-rogers avatar Jan 03 '22 02:01 paul-rogers

This pull request introduces 1 alert when merging bf6a16c3b0e5ccb41b4356e284135e8358baa25f into 4e97f5c7d9875dfda5785c6f01bc137973183d2b - view on LGTM.com

new alerts:

  • 1 for Unused format argument

lgtm-com[bot] avatar Mar 22 '22 20:03 lgtm-com[bot]