airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Add directory transfer support for SFTPOperator

Open Dawnpool opened this issue 1 year ago • 8 comments

This PR implements directory transfer for SFTPOperator, related to the issue I have raised before. Currently, the SFTPOperator only accepts file paths, and you have to specify every filename in a folder by list when transferring an entire folder. By adding some directory handling logic, the operator now can accept directory paths as well, allowing users easily transfer entire folders.

Dawnpool avatar Nov 18 '24 07:11 Dawnpool

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst) Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits. Apache Airflow is a community-driven project and together we are making it better 🚀. In case of doubts contact the developers at: Mailing List: [email protected] Slack: https://s.apache.org/airflow-slack

boring-cyborg[bot] avatar Nov 18 '24 07:11 boring-cyborg[bot]

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

kunaljubce avatar Nov 18 '24 17:11 kunaljubce

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

Hi, I guess there was an issue with the test code at the time I forked the repository. It might have been resolved by merging the main branch. There is no problem on my local breeze environment for now.

Dawnpool avatar Nov 24 '24 16:11 Dawnpool

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

Hi, I guess there was an issue with the test code at the time I forked the repository. It might have been resolved by merging the main branch. There is no problem on my local breeze environment for now.

Let's wait for one of the maintainers to approve the remaining workflows. That will give you some clarity if rebasing against main fixed it. Feel free to drop a gentle reminder on the Slack channel - #contributors.

kunaljubce avatar Nov 26 '24 06:11 kunaljubce

Let's wait for one of the maintainers to approve the remaining workflows. That will give you some clarity if rebasing against main fixed it. Feel free to drop a gentle reminder on the Slack channel - #contributors.

Approved workflows.

potiuk avatar Nov 28 '24 00:11 potiuk

errrors, errors everywhere :D

potiuk avatar Nov 28 '24 03:11 potiuk

@Dawnpool Try reproducing your CI tests locally and figuring out what's going wrong. Seems this is a good place to start - https://github.com/apache/airflow/blob/main/dev/breeze/doc/ci/08_running_ci_locally.md

kunaljubce avatar Nov 28 '24 16:11 kunaljubce

PR has conflicts that needs to be resolved

eladkal avatar Dec 19 '24 09:12 eladkal

@potiuk The non-db tests keep failing and it doesn't seem related to the changes I made based on the error logs. I noticed your comment(https://github.com/apache/airflow/pull/44625#issuecomment-2522299932) on a recent approved PR with the same error, where you mentioned that it was an unrelated intermittent error. Is there an ongoing issue with the non-db tests?

@eladkal I have resolved the conflicts. Please feel free to check!

Dawnpool avatar Dec 23 '24 08:12 Dawnpool

I think it's a side effect of bad configuration - likely some initialization issue in some of the skipped provider tests that are missing somewhere. Likely it is mitigated when full tests are run as some other tests are intitializing SQL Alchemy (where we completely should not need it for DB Tests). I will take a look at it later today and try to find out what it is (And it would be great to not merge it till then as we might be eble to see if the problem is fixed when I find it.

potiuk avatar Dec 23 '24 12:12 potiuk

Can you please rebase/resolve conflicts and ping me if it fails again.

potiuk avatar Dec 27 '24 10:12 potiuk

@potiuk it failed again😥 same error with the non-db tests

Dawnpool avatar Dec 27 '24 12:12 Dawnpool

:scream:

potiuk avatar Dec 27 '24 13:12 potiuk

I think this should fix it: https://github.com/apache/airflow/pull/45244

potiuk avatar Dec 27 '24 13:12 potiuk

Let's see - I added the fix on top and we should see if the problem is fixed.

potiuk avatar Dec 27 '24 13:12 potiuk

Nope - there is another issue

potiuk avatar Dec 27 '24 15:12 potiuk

All right. I think I found it https://github.com/apache/airflow/pull/45249 - very interesting issue (and your change accidentally revealed it because it selectively run non-db microsoft.azure test collection as the first group of tests to run - and it turned out that those tests relied on a side-effect of other tests being run before.

potiuk avatar Dec 27 '24 17:12 potiuk

OK. The fix is merged. You can rebase again and I :crossed_fingers: that it should work now.

potiuk avatar Dec 27 '24 20:12 potiuk

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

boring-cyborg[bot] avatar Dec 28 '24 01:12 boring-cyborg[bot]

Finally!

potiuk avatar Dec 28 '24 01:12 potiuk

Finally! Thank you all😁

Dawnpool avatar Dec 28 '24 01:12 Dawnpool

Thank you for your contribution @Dawnpool!

In this implementation, files are downloaded sequentially, right? If so - when downloading folders containing a large number of small files, this approach will be quite inefficient. I think we should consider expanding the method, or adding another one that downloads files concurrently.

Possibly related to SFTPHookAsync.

Dev-iL avatar Dec 28 '24 15:12 Dev-iL

Hi @Dev-iL! Yes, just as you said, files are downloaded sequentially since it simply uses multiple calls to the retrieve_file function in the order file names are collected. I totally agree with your point this approach would be inefficient in the case you said. It would be awesome if we could download multiple files concurrently, but I'm not sure what kind of approach to take (probably multi-threading?) and we should definitely do performance checks to compare after implementing the method.

Dawnpool avatar Dec 31 '24 04:12 Dawnpool