flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-37247][FileSystems][Tests] Implement common Hadoop file system integration tests for GCS.

Open cnauroth opened this issue 10 months ago • 6 comments

What is the purpose of the change

Increase test coverage for GCS integration by providing integration tests implemented from the common flink-hadoop-fs test suites. These optional tests run only if the environment is configured for access to GCS.

Brief change log

  • Create JUnit extension RequireGCSConfiguration. The extension checks for environment variables GOOGLE_APPLICATION_CREDENTIALS and GCS_BASE_PATH. If these are not defined, then the tests are skipped.
  • Create GCS subclasses of the suites defined in flink-hadoop-fs.
  • Two tests needed to be overridden in the subclasses and skipped. One isn't meaningful for GCS and triggers a false failure. The other is going to take a lot more intrusive changes to get working on GCS because of an assumption that state is held in exactly one file, which isn't true for the GCS implementation.
  • Allow RecoverableWriter tests to return null from getLocalTmpDir() to indicate no allocation/release of local storage is required.
  • A new ArchUnit exception is added, because these integration tests can't meaningfully use the mini-cluster extensions. This is similar to how it's handled for other file systems.

Verifying this change

This change added tests and can be verified as follows:

New integration tests are skipped when run without setting the configuration, so there is no impact to existing developer workflows:

mvn -Pfast -pl flink-filesystems/flink-gs-fs-hadoop clean verify
...
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
...

With configuration in place, the tests pass (with the exception of the two that are skipped):

export GOOGLE_APPLICATION_CREDENTIALS=/tmp/credentials.json
export GCS_BASE_PATH=gs://<BUCKET>/flink-tests
mvn -Pfast -pl flink-filesystems/flink-gs-fs-hadoop clean verify
...
[WARNING] Tests run: 35, Failures: 0, Errors: 0, Skipped: 2
...

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

cnauroth avatar Feb 03 '25 22:02 cnauroth

CI report:

  • f1f2bcec087cf6babd27f171c27f34f92416ac69 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Feb 03 '25 23:02 flinkbot

I pushed up a change, running mvn spotless:apply to fix the style violations.

The test failures appear to be unrelated.

cnauroth avatar Feb 04 '25 18:02 cnauroth

Hello @MartijnVisser . Are you available for a code review?

I'm planning on sending another pull request later for another GCS connector version upgrade. It would be nice to have these new tests in place in a known working state first.

Thanks!

cnauroth avatar Feb 11 '25 01:02 cnauroth

@MartijnVisser , gentle reminder requesting code review here on #26102 adding integration test coverage for the GCS file system path. Then, I'm also aiming for a GCS dependency upgrade: #26160 .

I'll also try @dannycranmer and @xintongsong in case Martijn isn't available.

Thanks everyone!

cnauroth avatar Feb 25 '25 23:02 cnauroth

This PR is being marked as stale since it has not had any activity in the last 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the community, contact details can be found here: https://flink.apache.org/what-is-flink/community/

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar May 27 '25 06:05 github-actions[bot]

I would still like to contribute this PR. I'll reach out to [email protected] to ask for help with review.

cnauroth avatar May 27 '25 20:05 cnauroth

@cnauroth thanks a lot for this PR! Do you see any option to get those tests integrated into Flink's CI system? Do you know if the ASF has some GCloud keys? I'm worried that these tests are breaking w/o anybody noticing.

rmetzger avatar Jul 07 '25 07:07 rmetzger

@rmetzger , thanks for taking a look!

I'm not certain about our options for GCS auth on ASF CI, or if Flink CI has this set up already for any of the other cloud object stores like S3. FWIW, Hadoop has long operated in a state where CI doesn't run the integration tests for S3/Azure/etc. and instead we rely on knowledgeable committers to make sure those tests are passing before committing relevant code changes. I agree full automation would be better.

At a high level, we either need to 1) run tests on a GCE VM, where service account credentials are served from a metadata server, or 2) export a JSON file with a service account key (and make really really sure the file is protected).

I can investigate more after 7/20.

cnauroth avatar Jul 10 '25 13:07 cnauroth

For the S3 connector, Ververica used (or maybe still is) provide AWS S3 credentials for this kind of testing (only for CI runs on master). Fully agree with you that this is a pretty risky approach, as it requires PRs to be carefully reviewed to not steal the credentials when merged.

I'm fine merging this PR for companies or individual committers and contributors to run the tests manually.

rmetzger avatar Jul 15 '25 14:07 rmetzger

@rmetzger , sorry for not giving a status update sooner, but I had started a similar conversation in Hadoop on 7/23:

https://lists.apache.org/thread/gy30bn6dm4qc3qwfdzdh2318m38yyvyn

I then filed INFRA-27071 to request a GCS bucket and credentials. The latest status in that bug is that infra has no presence on GCP and doesn't plan to use it, so it appears this is a non-starter at the moment.

I agree that this PR is still useful in its current state, especially if people have the option to run manually before upgrades like FLINK-37328. Can we resume review and eventual merge?

Fully agree with you that this is a pretty risky approach, as it requires PRs to be carefully reviewed to not steal the credentials when merged.

BTW as a side note, if we had gone ahead with this, then I would have advocated for setting it up as something committer-triggered (e.g. with a comment on the PR) only after careful review. Ideally the automation would also exchange service account credentials for a short-lived token with a tight downscoped access boundary, so even if something sneaks past us, at least it's not a powerful long-term credential.

cnauroth avatar Aug 23 '25 01:08 cnauroth

This PR is being marked as stale since it has not had any activity in the last 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the community, contact details can be found here: https://flink.apache.org/what-is-flink/community/

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Nov 21 '25 06:11 github-actions[bot]

Unstale

rmetzger avatar Nov 21 '25 06:11 rmetzger

@cnauroth can you do a final rebase to see if CI is passing, I'll then merge it.

rmetzger avatar Nov 25 '25 06:11 rmetzger