airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Auto ML assets

Open MaksYermak opened this issue 3 years ago • 4 comments

I have created links and updated system tests for Auto ML operators.

Co-authored-by: Wojciech Januszek [email protected] Co-authored-by: Lukasz Wyszomirski [email protected] Co-authored-by: Maksim Yermakou [email protected]


^ Add meaningful description above

Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

MaksYermak avatar Aug 02 '22 09:08 MaksYermak

Errors :(

potiuk avatar Aug 02 '22 19:08 potiuk

Rebased to acount for Flask 2.2 errors fixed yesterday.

potiuk avatar Aug 04 '22 15:08 potiuk

there is a csv file tests/system/providers/google/cloud/automl/resources/bank-marketing.csv

of 45K char , is that normal ?

raphaelauv avatar Aug 09 '22 22:08 raphaelauv

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

potiuk avatar Aug 10 '22 11:08 potiuk

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size?

MaksYermak avatar Aug 11 '22 13:08 MaksYermak

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size?

@potiuk Catching attention :) I think 2100 is okayish (not the best but certainly better than 50k). Please comment if you still think it should be stored in the external storage.

bhirsz avatar Aug 25 '22 10:08 bhirsz

Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unlikely to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git.

It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,

potiuk avatar Aug 25 '22 12:08 potiuk

Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unlikely to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git.

It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,

@potiuk I have done it

MaksYermak avatar Sep 01 '22 07:09 MaksYermak

Sorry for delay - been a bit busy.

No, It's not compressed - it's just bundled in .tar now not .zipped (.tar-ing single file kinda make no sense) . Stil takes 170 instead of 20K (and this PR needs rebase anyway).

potiuk avatar Sep 09 '22 02:09 potiuk

conflicts need to be resolved after string normalisation

potiuk avatar Oct 24 '22 22:10 potiuk

Rebased to rebuild.

potiuk avatar Oct 31 '22 04:10 potiuk

Tests failing.

potiuk avatar Dec 04 '22 23:12 potiuk

static check failures.

potiuk avatar Jan 17 '23 12:01 potiuk

REbased - static checks fixed in main (mysql python connector release breaking mypy)

potiuk avatar Jan 18 '23 09:01 potiuk

@potiuk I think that PR can be merged. I can't do that because I am not the author of PR and I don't have write access

MrGeorgeOwl avatar Jan 23 '23 10:01 MrGeorgeOwl