hadoop-connectors icon indicating copy to clipboard operation
hadoop-connectors copied to clipboard

feat: Enhance GCS connector docs for distcp with HNS

Open cjac opened this issue 10 months ago • 1 comments

This PR improves the Cloud Storage connector documentation to better support users performing distcp operations with Hierarchical Namespace (HNS) enabled buckets in self-managed Hadoop environments.

This change directly addresses customer issues observed in Salesforce case [56459963] and Buganizer report [389061732], where users experienced intermittent distcp failures, often manifesting as DEADLINE_EXCEEDED errors or generic SSH operator error: exit status = 25.

Key changes include:

  • gcs/CONFIGURATION.md:
    • Clarified guidance on fs.gs.http.read-timeout and fs.gs.hierarchical.namespace.folders.enable to address DEADLINE_EXCEEDED errors and ensure proper HNS interaction.
    • Added troubleshooting tips for generic exit codes and recommendations for using shaded JARs to resolve dependency conflicts.
  • gcs/INSTALL.md:
    • Expanded the "Troubleshooting the installation" section with more detailed advice on diagnosing dependency conflicts and enabling verbose logging, specifically highlighting its utility for DEADLINE_EXCEEDED errors.
  • gcs/README.md:
    • Updated the "Configuring the connector" section to prominently guide users facing distcp and HNS issues, including DEADLINE_EXCEEDED errors, to the more detailed CONFIGURATION.md.

These updates aim to provide clearer instructions and troubleshooting steps, reducing the need for support engagement for these common problems in non-Dataproc Hadoop deployments.

Self link: go/ghgcd/hadoop-connectors/pull/1374 Related CL: cl/767194879

Addresses support issue:
go/sf/55915396 (case)
go/sf/56459963 (consult)

Addresses GitHub issue#1375

Addresses bug: b/389061732

cjac avatar Jun 04 '25 17:06 cjac

Solves issue #1375

cjac avatar Jun 04 '25 20:06 cjac