rasterframes icon indicating copy to clipboard operation
rasterframes copied to clipboard

Miscellaneous/random `BufferUnderflowException`

Open metasim opened this issue 6 years ago • 13 comments

Context:

  • It takes about an hour or so into an analysis to happen
  • It doesn't happen all the time though
  • Probably about half the time I get this exception
Caused by: java.nio.BufferUnderflowException
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
    at java.nio.ByteBuffer.get(ByteBuffer.java:715)
    at geotrellis.util.StreamingByteReader.getBytes(StreamingByteReader.scala:120)
    at geotrellis.raster.io.geotiff.LazySegmentBytes.getBytes(LazySegmentBytes.scala:120)
    at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$readChunk$1.apply(LazySegmentBytes.scala:99)
    at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$readChunk$1.apply(LazySegmentBytes.scala:97)
    at scala.collection.immutable.List.map(List.scala:273)
    at geotrellis.raster.io.geotiff.LazySegmentBytes.readChunk(LazySegmentBytes.scala:97)
    at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$getSegments$1.apply(LazySegmentBytes.scala:115)
    at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$getSegments$1.apply(LazySegmentBytes.scala:115)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.read(GeoTiffRasterSource.scala:58)
    at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.read(RasterSource.scala:237)
    at org.locationtech.rasterframes.ref.RasterRef.realizedTile$lzycompute(RasterRef.scala:56)
    at org.locationtech.rasterframes.ref.RasterRef.realizedTile(RasterRef.scala:54)
    at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.delegate(RasterRef.scala:71)
    at geotrellis.raster.DelegatingTile$class.get(DelegatingTile.scala:54)
    at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.get(RasterRef.scala:63)
    at geotrellis.raster.resample.NearestNeighborResample.resampleValid(NearestNeighborResample.scala:36)
    at geotrellis.raster.resample.Resample.resample(Resample.scala:65)
    at geotrellis.raster.reproject.SinglebandRasterReprojectMethods$class.reproject(SinglebandRasterReprojectMethods.scala:86)
    at geotrellis.raster.package$withSinglebandRasterMethods.reproject(package.scala:100)
    at geotrellis.raster.reproject.RasterReprojectMethods$class.reproject(RasterReprojectMethods.scala:40)
    at geotrellis.raster.package$withSinglebandRasterMethods.reproject(package.scala:100)
    at geotrellis.raster.reproject.SinglebandTileReprojectMethods$class.reproject(SinglebandTileReprojectMethods.scala:30)
    at geotrellis.raster.package$withTileMethods.reproject(package.scala:55)
    at org.locationtech.rasterframes.extensions.RasterJoin$$anonfun$1$$anonfun$apply$2.apply(RasterJoin.scala:56)
    at org.locationtech.rasterframes.extensions.RasterJoin$$anonfun$1$$anonfun$apply$2.apply(RasterJoin.scala:54)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.locationtech.rasterframes.extensions.RasterJoin$$anonfun$1.apply(RasterJoin.scala:54)
    at org.locationtech.rasterframes.extensions.RasterJoin$$anonfun$1.apply(RasterJoin.scala:39)

metasim avatar Jun 03 '19 16:06 metasim

I am trying this now and getting something like 287 spark tasks fail out and 795 complete.

But then the entire Stage fails because a particular task has failed 4 times.

The data set that I am trying to read in this case is much larger than originally reported.

vpipkt avatar Jun 07 '19 15:06 vpipkt

Same error?

metasim avatar Jun 07 '19 17:06 metasim

Yes same error, but now I am seeing a similar problem but with read timeouts.

Job aborted due to stage failure: Task 741 in stage 6.0 failed 4 times, most recent failure: Lost task 741.3 in stage 6.0 (TID 2564, ip-172-31-26-162.ec2.internal, executor 294): java.net.SocketTimeoutException: Read timed out +details
Job aborted due to stage failure: Task 741 in stage 6.0 failed 4 times, most recent failure: Lost task 741.3 in stage 6.0 (TID 2564, ip-172-31-26-162.ec2.internal, executor 294): java.net.SocketTimeoutException: Read timed out
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1944)
	at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1939)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1938)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1508)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
	at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:367)
	at scalaj.http.HttpRequest.exec(Http.scala:343)
	at scalaj.http.HttpRequest.execute(Http.scala:331)
	at scalaj.http.HttpRequest.asBytes(Http.scala:489)
	at geotrellis.spark.io.http.util.HttpRangeReader.readClippedRange(HttpRangeReader.scala:70)
	at geotrellis.util.RangeReader$class.readRange(RangeReader.scala:36)
	at geotrellis.spark.io.http.util.HttpRangeReader.readRange(HttpRangeReader.scala:34)
	at geotrellis.util.StreamingByteReader.readChunk(StreamingByteReader.scala:99)
	at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:110)
	at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:126)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:339)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:219)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:206)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff$lzycompute(GeoTiffRasterSource.scala:35)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff(GeoTiffRasterSource.scala:34)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent$lzycompute(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.RasterSource$class.cols(RasterSource.scala:76)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.cols(GeoTiffRasterSource.scala:28)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info$lzycompute(RasterSource.scala:206)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info(RasterSource.scala:204)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.crs(RasterSource.scala:220)
	at org.locationtech.rasterframes.ref.RasterRef.crs(RasterRef.scala:43)
	at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.<init>(RasterRef.scala:65)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:131)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:124)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at astraea.spark.datasource.expressions.ReadBands.eval(ReadBands.scala:124)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:94)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:91)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
	at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
	at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
	at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:365)
	... 50 more

Driver stacktrace:

vpipkt avatar Jun 07 '19 18:06 vpipkt

It is very inconsistent though. I'm running again and so far have 9 completed tasks, 60 tasks with the java.nio.BufferUnderflowException error and one with the java.net.SocketTimeoutException.

vpipkt avatar Jun 07 '19 18:06 vpipkt

@metasim I ran the workflow where I was getting this error a lot several times yesterday and never got this error. I'm not sure if time of day is related or not but I seem to not get this error much in the evenings or on Sundays.

courtney-layman avatar Jun 10 '19 14:06 courtney-layman

For tracking this is all happening on an internal snapshot release 0.8.0-astraea.6b15a5b1

vpipkt avatar Jun 17 '19 13:06 vpipkt

Wondering if this info is relevant:

Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.

Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. These data lake applications achieve single-instance transfer rates that maximize the network interface use for their Amazon EC2 instance, which can be up to 100 Gb/s on a single instance. These applications then aggregate throughput across multiple instances to get multiple terabits per second.

Roughly, a single L8 scene discretized at 256x256 will take at least

ceil(7681/256) * ceil(7801/256) + 2 = 963

requests. Seems likely we could easily hit a theoretical 5,500 req/sec threshold?

metasim avatar Jun 17 '19 18:06 metasim

I am trying again against the latest https://github.com/locationtech/rasterframes/commit/3538bda126d885e3daa8b3913a82b3890a78da23

I am now trying a query of 211 Landsat scenes in 800 partitions.

Getting 11.5% of task attempts fail, with the following three distinct errors. Stack traces below.

13x geotrellis.raster.io.geotiff.reader.MalformedGeoTiffException: incorrect byte order 21x java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads 70x java.net.SocketTimeoutException: Read timed out

SocketTimeoutException

java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:666)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
	at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:367)
	at scalaj.http.HttpRequest.exec(Http.scala:343)
	at scalaj.http.HttpRequest.execute(Http.scala:331)
	at scalaj.http.HttpRequest.asBytes(Http.scala:489)
	at geotrellis.spark.io.http.util.HttpRangeReader.readClippedRange(HttpRangeReader.scala:70)
	at geotrellis.util.RangeReader$class.readRange(RangeReader.scala:36)
	at geotrellis.spark.io.http.util.HttpRangeReader.readRange(HttpRangeReader.scala:34)
	at geotrellis.util.StreamingByteReader.readChunk(StreamingByteReader.scala:99)
	at geotrellis.util.StreamingByteReader.ensureChunk(StreamingByteReader.scala:110)
	at geotrellis.util.StreamingByteReader.get(StreamingByteReader.scala:126)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:339)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:219)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:206)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff$lzycompute(GeoTiffRasterSource.scala:35)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff(GeoTiffRasterSource.scala:34)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent$lzycompute(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.RasterSource$class.cols(RasterSource.scala:76)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.cols(GeoTiffRasterSource.scala:28)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:224)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.retryableRead(RasterSource.scala:201)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info$lzycompute(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.crs(RasterSource.scala:238)
	at org.locationtech.rasterframes.ref.RasterRef.crs(RasterRef.scala:43)
	at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.<init>(RasterRef.scala:65)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:131)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:124)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at astraea.spark.datasource.expressions.ReadBands.eval(ReadBands.scala:124)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:94)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:91)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

IllegalArgumentException

java.lang.IllegalArgumentException: requirement failed: Server doesn't support ranged byte reads
	at scala.Predef$.require(Predef.scala:224)
	at geotrellis.spark.io.http.util.HttpRangeReader.<init>(HttpRangeReader.scala:57)
	at geotrellis.contrib.vlm.package$.getByteReader(package.scala:52)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff$lzycompute(GeoTiffRasterSource.scala:35)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff(GeoTiffRasterSource.scala:34)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent$lzycompute(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.RasterSource$class.cols(RasterSource.scala:76)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.cols(GeoTiffRasterSource.scala:28)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:224)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.retryableRead(RasterSource.scala:201)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info$lzycompute(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.crs(RasterSource.scala:238)
	at org.locationtech.rasterframes.ref.RasterRef.crs(RasterRef.scala:43)
	at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.<init>(RasterRef.scala:65)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:131)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:124)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at astraea.spark.datasource.expressions.ReadBands.eval(ReadBands.scala:124)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:94)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:91)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

MalformedGeoTiffException

geotrellis.raster.io.geotiff.reader.MalformedGeoTiffException: incorrect byte order
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:344)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:219)
	at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readMultiband(GeoTiffReader.scala:206)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff$lzycompute(GeoTiffRasterSource.scala:35)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.tiff(GeoTiffRasterSource.scala:34)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent$lzycompute(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.rasterExtent(GeoTiffRasterSource.scala:37)
	at geotrellis.contrib.vlm.RasterSource$class.cols(RasterSource.scala:76)
	at geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource.cols(GeoTiffRasterSource.scala:28)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:224)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource$$anonfun$info$1.apply(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.retryableRead(RasterSource.scala:201)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info$lzycompute(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.info(RasterSource.scala:222)
	at org.locationtech.rasterframes.ref.RasterSource$DelegatingRasterSource.crs(RasterSource.scala:238)
	at org.locationtech.rasterframes.ref.RasterRef.crs(RasterRef.scala:43)
	at org.locationtech.rasterframes.ref.RasterRef$RasterRefTile.<init>(RasterRef.scala:65)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:131)
	at astraea.spark.datasource.expressions.ReadBands$$anonfun$9.apply(ReadBands.scala:124)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at astraea.spark.datasource.expressions.ReadBands.eval(ReadBands.scala:124)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:94)
	at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:91)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

vpipkt avatar Jun 18 '19 00:06 vpipkt

@metasim :

requests. Seems likely we could easily hit a theoretical 5,500 req/sec threshold?

Perhaps so, I'm not sure how to test that possibility.

vpipkt avatar Jun 18 '19 00:06 vpipkt

And thinking about it more

5,500 GET/HEAD requests per second per prefix in a bucket

We have landsat scenes laid out in prefixes in such a fashion that we will be requesting from 8 bands in each prefix (7 reflectance plus QA).

So that would make roughly 1000 requests per band times 8 = 8000, making it seem more likely to hit that threshold of 5500 / second.

This might also jive with my hypothesis that increasing the number of partitions made things worse, presumably because you have more concurrent reads per prefix.

So maybe a short term solution is to implement some kind of back-off? I would also be curious to see how many reads are actually done for a single scene in our jobs, but not sure how to capture that.

vpipkt avatar Jun 18 '19 10:06 vpipkt

@mteldridge @mobsy74 Any reflections you might have on this are welcomed.

Currently the testing path is arduous, so we may need to winnow this down at some point.

metasim avatar Jun 20 '19 13:06 metasim

@pomadchin Do you have any gut reactions to any of this? Our suspicion is that we're getting HTTP 429 codes from S3.

The call path invoked is through geotrellis.contrib.vlm.geotiff.GeoTiffRasterSource, which hits the ByteReader interface (eventually) from geotrellis.raster.io.geotiff.reader.GeoTiffReader.readGeoTiffInfo and geotrellis.raster.io.geotiff.LazySegmentBytes.getBytes. If the underlying ByteReader is a geotrellis.spark.io.http.util.HttpRangeReader, and 429 is returned (or some other network error), would you expect the HttpRangeReader to recover?

cc: @philvarner

metasim avatar Jun 20 '19 13:06 metasim

@courtney-whalen & @vpipkt :

So maybe a short term solution is to implement some kind of back-off? I would also be curious to see how many reads are actually done for a single scene in our jobs, but not sure how to capture that.

If so, it needs to happen further down in the call stack, in GeoTrellis. Certainly and option but we need to prove this is the case first.

The other thing we need is in these jobs to turn off lazy tile reading and load full scenes all at once. If your RasterFrames are really wide, it's going to require a lot more memory, possibility causing other problems (more nodes and more partitions helps), but will drastically change the network dynamics.

metasim avatar Jun 20 '19 13:06 metasim