SolarPanelDataWrangler Random OSError while running inference

Unsure what caused it, will track here. Looks like maybe a file got corrupted? Should probably regenerate imagery at this tile if this happens.

Traceback (most recent call last):
  File "run_entire_process.py", line 68, in <module>
    run_inference.run_classification(args.classification_checkpoint, args.segmentation_checkpoint, BATCHES_BETWEEN_DELETE)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/run_inference.py", line 113, in run_classification
    image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 201, in stitch_image_at_coordinate
    images.append(get_image_for_coordinate((column, row),))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 189, in get_image_for_coordinate
    image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 166, in gather_and_persist_imagery_at_coordinate
    slices_per_side=grid_size)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 85, in slice_image
    out = double_image_size(out)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 100, in double_image_size
    return image.resize((image.size[0] * 2, image.size[0] * 2), filter)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/Image.py", line 1804, in resize
    self.load()
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/ImageFile.py", line 238, in load
    len(b))
OSError: image file is truncated (82 bytes not processed)

Apr 10 '19 14:04 typicalTYLER

SO to suggest that it is a corrupted image (albeit to a small degree) due to HTTP protocols having a max transfer size, and recommend setting PIL.ImageFile.LOAD_TRUNCATED_IMAGES = True. Looks like this might still leave a small gray or white band at the bottom of the image, though, so I'm not sure it's the best solution (this was the first google post that popped up).

Apr 10 '19 14:04 sallamander

I'm unsure if that would be the best thing to do just because then we don't know the extent of the corruption, it's effect on classification, and how often it's happening. Although it would make the error go away. I don't think it's happening due to max HTTP protocol transfer size, or else it would've happened more than once, I'm currently at ~20k queries to Mapbox for my API key, and they've all been the exact same query but at different locations.

It seems like it's either a bug with the Mapbox API returning a corrupted image, or something within PIL. It fails out right when the first operation is done to the new image queried from Mapbox. Also, I checked the inference_timestamp in the database and the last one before the failure seems to be 2:45 am my time zone, which is very close to when #13 was happening. So maybe when the servers are reset nightly it might close the connection early and truncate the image?

Apr 10 '19 15:04 typicalTYLER

I'm unsure if that would be the best thing to do just because then we don't know the extent of the corruption, it's effect on classification, and how often it's happening.

Yeah, agreed.

So maybe when the servers are reset nightly it might close the connection early and truncate the image?

This sounds promising! I would assume that the relevant metadata for the failed image is in the database, and if / when you re-query any failed images they work, I would consider this an issue that we can work around with timeouts and / or some other optimizations.

Apr 11 '19 15:04 sallamander

Yep, your assumptions are correct, if the error happens like this again, no corrupt imagery should be saved and no metadata updated. I think we can maybe just catch this case in the same place we catch connection errors, although catching every OSError seems a little broad. Is doing text matching on a specific throwable considered bad practice?

Apr 11 '19 15:04 typicalTYLER

I'm actually not sure. It's certainly better than catching it and doing no text matching. Are there other alternatives? I suppose the temporary workaround you suggested in https://github.com/typicalTYLER/SolarPanelDataWrangler/issues/13 could work? I think the text matching is a better option than that, though.

Apr 11 '19 16:04 sallamander

Tentative fix in the world file branch: 7b55636a47356ca08463bfd8ca983be937f16f56 as I got a different connection error outside of the existing handling too. Will continue to run classification overnight to see if it works.

Apr 16 '19 15:04 typicalTYLER

I've also got OSError, identical to this one. Unfortunately, I did't found any solution which will resolve it. PS. I installed SolarPanelDataWrangler via Docker (without GPU) however, I had to install some libraries manually inside the container. Maybe that will help anyhow.

May 01 '19 10:05 SylwiaOliwia2

@SylwiaOliwia2 could you post your stack trace if possible? I think it might be a separate issue, as the one I was getting was only happening once at night after running classification for a long time (And OSError is very broad). Were you able to run any classification? Or did it just fail out immediately?

Also if you had any troubles with the docker container and had to manually install anything to get it to work, feel free to create an issue and we can take a look (if you have permission, GitHub permissions are still a mystery to me).

May 01 '19 16:05 typicalTYLER

@SylwiaOliwia2 what all did you have to install?

May 01 '19 16:05 sallamander

@typicalTYLER it tried several times, always it started showing X tiles/s | X avg tiles/s then it crashed:

^[Running classification on every tile in the search polygon that hasn't had inference ran yet.
2019-04-30 17:57:45.101474: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194895000 Hz
2019-04-30 17:57:45.124621: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5607dec270a0 executing computations on platform Host. Devices:
2019-04-30 17:57:45.124784: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Starting extraneous imagery cleanup/deletion
Calculation for expanded positive coords for Blaubeuren, Germany completed
Deletion finished
2.18 tiles/s | 2.18 avg tiles/s
2.49 tiles/s | 2.34 avg tiles/s
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/conda/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/opt/conda/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/opt/conda/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/conda/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/opt/conda/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 302, in recv_into
    raise SocketError(str(e))
OSError: (104, 'ECONNRESET')

and below were the other errors, as mentioned in this link.

@sallamander I can't recall all the libraries, definitely was it:

sqlalchemy
geojsonio
geopandas
mapbox
overpy
tensorflow (this one I'm not 100% sure)

May 02 '19 08:05 SylwiaOliwia2

@SylwiaOliwia2 Got it, thanks for clarifying. Did you have the spdw conda environment active when you had to install these? The container should come with a prebuilt conda environment that once activated (e.g. conda activate spdw) should have all of those dependencies.

May 02 '19 12:05 sallamander

@sallamander I didn't as I thought it's required only for the Manual Setup, not for Docker Setup.

May 15 '19 11:05 SylwiaOliwia2

@SylwiaOliwia2 ah, okay. While you don't have to install the environment, you still do need to activate it to use it, even in the Docker container.

May 16 '19 16:05 sallamander

Thanks, @sallamander. I went through the docker installation again using conda spdw. I got the same error, so if it's working for you, there's a problem on my site.

May 17 '19 05:05 SylwiaOliwia2

@SylwiaOliwia2, did you type conda spdw, or conda activate spdw (or source activate spdw)? The former shouldn't work, but I think the latter should. If it doesn't, then it sounds like maybe the docker installation isn't working as expected.

May 17 '19 20:05 sallamander