adam
adam copied to clipboard
Timeout waiting for connection from pool for 1000 genomes vcf on AWS
val x = sc.loadGenotypes("s3a://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
generates error Unable to execute HTTP request: Timeout waiting for connection from pool with net.fnothaft:jsr203-s3a:0.0.2.
This error was tested with Hadoop-BAM 7.9.2 and 7.9.1
Sigh, I am seeing this too...
@fnothaft how are you running? Are you on EMR or through toil on standard aws instances? Apparently EMR dropped support for s3a. However, I can still loadAlignments from s3a, but not vcfs. Fortunately, s3 works just fine for vcfs (but is sloww)
Apparently EMR dropped support for s3a.
When did that happen? And at a specific version of EMR?
Fortunately, s3 works just fine for vcfs (but is sloww)
Practically, conductor is still a good solution for s3 → HDFS, and is faster than s3-dist-cp. Conductor can't upload directories of Parquet+Avro from HDFS → s3 though, so you'd need to fall back to s3-dist-cp for that.
I'm not sure when s3a was dropped from. @delagoya may know more, as they were my informant.
Are you able to use s3n?
I am researching with the EMR team about what is the supported URL encodings.
Was just passing through. Hopefully everyone has seen this page but linking just in case: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
Interesting that s3:// on EMR is slower than s3a:// considering EMRFS (EMR's proprietary S3 impl) is one of it's selling points. You might be able to use s3a URL's consistently by setting the following parameters:
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
<property>
<name>fs.AbstractFileSystem.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
<description>The implementation class of the S3A AbstractFileSystem.</description>
</property>
Link: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
This is all untested but I might give this a whirl when I get a moment and see if I can get this working and post results here.
@dstockstad Thanks for the note! Where do those properties need to be specified?
You're going to want to do it using the instructions here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
The settings go into core-site. So something like this:
[
{
"Classification": "core-site",
"Properties": {
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"fs.AbstractFileSystem.s3a.impl": "org.apache.hadoop.fs.s3a.S3A",
}
}
]
Keep in mind that I still have not actually verified this so can't say for sure whether it will work and might also need additional configuration.