torchgeo
torchgeo copied to clipboard
CDL: cannot redownload additional years
Description
Data should be downloading, over an hour in nothing has happened
Steps to reproduce
from torchgeo.datasets import CDL
dataset = CDL(years=[2022], download=True, )
Version
0.6.0.dev0
Nothing wrong with connection, can manually download
I'm unable to reproduce this issue. I tried both 2017 and 2022 and both downloaded fine on my system. What version of torchvision are you using? Can you try upgrading to the newest version?
I have torchvision==0.17.1+cu121
I upgrade, and the cell now executes immediately, but no data is downloaded (2022)
Is it possible that you already have some CDL data somewhere in that folder recursively?
Dont see anything:
⚡ ~ find data -type f
data/2017_30m_cdls.aux
data/2017_30m_cdls.tfw
data/Metadata_Cropland-Data-Layer.htm
data/2017_30m_cdls.zip
data/2017_30m_cdls.tif
data/2017_30m_cdls.tif.ovr
Also, even the manually downloaded dataset doesn't look correct, shouldn't this work?:
Your screenshot doesn't contain the full stack trace, and I also can't copy-n-paste error messages from screenshots...
'0.6.0.dev0'
CDL Dataset
type: GeoDataset
bbox: BoundingBox(minx=-127.88721217969017, maxx=-65.34561975376272, miny=22.94022503977174, maxy=51.60512156832182, mint=1483228800.0, maxt=1514764799.999999)
size: 1
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[8], [line 4](vscode-notebook-cell:?execution_count=8&line=4)
[1](vscode-notebook-cell:?execution_count=8&line=1) sampler = RandomGeoSampler(dataset, size=224, length=3)
[2](vscode-notebook-cell:?execution_count=8&line=2) dataloader = DataLoader(dataset, sampler=sampler, collate_fn=stack_samples)
----> [4](vscode-notebook-cell:?execution_count=8&line=4) for batch in dataloader:
[5](vscode-notebook-cell:?execution_count=8&line=5) sample = unbind_samples(batch)[0]
[6](vscode-notebook-cell:?execution_count=8&line=6) dataset.plot(sample)
File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631), in _BaseDataLoaderIter.__next__(self)
[628](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:628) if self._sampler_iter is None:
[629](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:629) # TODO(https://github.com/pytorch/pytorch/issues/76750)
[630](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630) self._reset() # type: ignore[call-arg]
--> [631](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631) data = self._next_data()
[632](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:632) self._num_yielded += 1
[633](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633) if self._dataset_kind == _DatasetKind.Iterable and \
[634](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:634) self._IterableDataset_len_called is not None and \
[635](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:635) self._num_yielded > self._IterableDataset_len_called:
File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674), in _SingleProcessDataLoaderIter._next_data(self)
[673](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:673) def _next_data(self):
--> [674](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674) index = self._next_index() # may raise StopIteration
[675](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:675) data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[676](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:676) if self._pin_memory:
File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621), in _BaseDataLoaderIter._next_index(self)
[620](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:620) def _next_index(self):
--> [621](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621) return next(self._sampler_iter)
File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287), in BatchSampler.__iter__(self)
[285](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:285) batch = [0] * self.batch_size
[286](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:286) idx_in_batch = 0
--> [287](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287) for idx in self.sampler:
[288](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:288) batch[idx_in_batch] = idx
[289](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:289) idx_in_batch += 1
File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140), in RandomGeoSampler.__iter__(self)
[133](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:133) """Return the index of a dataset.
[134](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:134)
[135](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:135) Returns:
[136](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:136) (minx, maxx, miny, maxy, mint, maxt) coordinates to index a dataset
[137](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:137) """
[138](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:138) for _ in range(len(self)):
[139](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:139) # Choose a random tile, weighted by area
--> [140](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140) idx = torch.multinomial(self.areas, 1)
[141](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:141) hit = self.hits[idx]
[142](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:142) bounds = BoundingBox(*hit.bounds)
RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement
Never seen this error before, interesting...
We still need to figure out how to reproduce this. Are you able to reproduce this in Google Colab or some other shared computing resource I can access? That will make it easier to debug.
If you create an account on https://lightning.ai/ I can grant you access!
I can't reproduce this locally with main
branch
One thing I am noticing is that the bounds shown in the output of your print(dataset)
seem to be in lat/lon while mine are not:
bbox: BoundingBox(minx=-2356095.0, maxx=2258235.0, miny=276915.0, maxy=3172605.0, mint=1483228800.0, maxt=1514764799.999999)
Is there anything else in the data/
directory?
I cannot reproduce the issue either. The dataset can be downloaded immediately. I did find that the other years can't be downloaded after downloading some years. For example:
from torchgeo.datasets import CDL
dataset = CDL(years=[2022], download=True, )
This can download the corresponding year without issues. But if I restart the terminal and run
from torchgeo.datasets import CDL
dataset = CDL(years=[2023], download=True, )
It won't download anything. It seems that the download function only works for the first time when the data directory doesn't have any downloaded CDL files. This issue is not related to certain years. I tried different combination of years.
One bug here is that if I do:
dataset = CDL(paths="data/", years=[2017], download=True)
and the data/
directory is empty, then the 2017 layer is downloaded as expected. However, if I then do:
dataset = CDL(paths="data/", years=[2023], download=True)
the second download of the 2023 layer does not happen.
Edit: It seems @yichiac and I discovered this at the same time 🙂
in ._verify(self)
the following code should take into account the current layers requested:
pathname = os.path.join(
self.paths, self.zipfile_glob.replace("*", str(year))
)
Can confirm (for my own sanity) that this bug I only see on lighnting.ai, will ask them
The problem is actually higher up:
# Check if the extracted files already exist
if self.files:
return
If any CDL files are found, the method exits, even if the specific years you requested aren't there. This broke in #1442. The fix would be to check for the specific years requested. However, this is difficult if you can't know whether paths
is a directory or a list of files. Anyone want to take a stab at fixing this?
The problem is actually higher up:
Yes, just discovered this as well
I found if I run the command in terminal (rather than jupyter) I get a warning - I pointed to a fresh directory (data2):
>>> from torchgeo.datasets import CDL
>>> dataset = CDL(paths='/teamspace/studios/this_studio/data2/', years=[2010], download=True, checksum=False, crs="EPSG:4326")
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/geo.py:313: UserWarning: Could not find any relevant files for provided path '/teamspace/studios/this_studio/data2/'. Path was ignored.
warnings.warn(
Appears it is ignoring the path and hanging. If I interrupt and rerun the command, I do not get the warning. On keyboard interrupt I get the following:
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/geo.py:313: UserWarning: Could not find any relevant files for provided path 'data'. Path was ignored.
warnings.warn(
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[2], line 2
1 # dataset = CDL(years=[2017], download=False, checksum=False, crs="EPSG:4326") # manually downloaded
----> 2 dataset = CDL(years=[2020], download=True, checksum=False, crs="EPSG:4326") #
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:263, in CDL.__init__(self, paths, crs, res, years, classes, transforms, cache, download, checksum)
260 self.ordinal_map = torch.zeros(max(self.cmap.keys()) + 1, dtype=self.dtype)
261 self.ordinal_cmap = torch.zeros((len(self.classes), 4), dtype=torch.uint8)
--> 263 self._verify()
265 super().__init__(paths, crs, res, transforms=transforms, cache=cache)
267 # Map chosen classes to ordinal numbers, all others mapped to background class
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:315, in CDL._verify(self)
312 raise DatasetNotFoundError(self)
314 # Download the dataset
--> 315 self._download()
316 self._extract()
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:321, in CDL._download(self)
319 """Download the dataset."""
320 for year in self.years:
--> 321 download_url(
322 self.url.format(year),
323 self.paths,
324 md5=self.md5s[year] if self.checksum else None,
325 )
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/datasets/utils.py:130, in download_url(url, root, filename, md5, max_redirect_hops)
127 _download_file_from_remote_location(fpath, url)
128 else:
129 # expand redirect chain if needed
--> 130 url = _get_redirect_url(url, max_hops=max_redirect_hops)
132 # check if file is located on Google Drive
133 file_id = _get_google_drive_file_id(url)
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/datasets/utils.py:78, in _get_redirect_url(url, max_hops)
75 headers = {"Method": "HEAD", "User-Agent": USER_AGENT}
77 for _ in range(max_hops + 1):
---> 78 with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
79 if response.url == url or response.url is None:
80 return url
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
214 else:
215 opener = _opener
--> 216 return opener.open(url, data, timeout)
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:519, in OpenerDirector.open(self, fullurl, data, timeout)
516 req = meth(req)
518 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 519 response = self._open(req, data)
521 # post-process response
522 meth_name = protocol+"_response"
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:536, in OpenerDirector._open(self, req, data)
533 return result
535 protocol = req.type
--> 536 result = self._call_chain(self.handle_open, protocol, protocol +
537 '_open', req)
538 if result:
539 return result
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
494 for handler in handlers:
495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
497 if result is not None:
498 return result
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:1391, in HTTPSHandler.https_open(self, req)
1390 def https_open(self, req):
-> 1391 return self.do_open(http.client.HTTPSConnection, req,
1392 context=self._context, check_hostname=self._check_hostname)
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:1352, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
1350 except OSError as err: # timeout error
1351 raise URLError(err)
-> 1352 r = h.getresponse()
1353 except:
1354 h.close()
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:1374, in HTTPConnection.getresponse(self)
1372 try:
1373 try:
-> 1374 response.begin()
1375 except ConnectionError:
1376 self.close()
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:318, in HTTPResponse.begin(self)
316 # read until we get a non-100 response
317 while True:
--> 318 version, status, reason = self._read_status()
319 if status != CONTINUE:
320 break
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:279, in HTTPResponse._read_status(self)
278 def _read_status(self):
--> 279 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
280 if len(line) > _MAXLINE:
281 raise LineTooLong("status line")
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/socket.py:705, in SocketIO.readinto(self, b)
703 while True:
704 try:
--> 705 return self._sock.recv_into(b)
706 except timeout:
707 self._timeout_occurred = True
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/ssl.py:1274, in SSLSocket.recv_into(self, buffer, nbytes, flags)
1270 if flags != 0:
1271 raise ValueError(
1272 "non-zero flags not allowed in calls to recv_into() on %s" %
1273 self.__class__)
-> 1274 return self.read(nbytes, buffer)
1275 else:
1276 return super().recv_into(buffer, nbytes, flags)
File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/ssl.py:1130, in SSLSocket.read(self, len, buffer)
1128 try:
1129 if buffer is not None:
-> 1130 return self._sslobj.read(len, buffer)
1131 else:
1132 return self._sslobj.read(len)
KeyboardInterrupt:
Hey,
I can reproduce the same issue in a Studio on Lightning.Ai. The hanging seems to be coming from torchvision:
Here is a minimal repro.
import urllib
import urllib.error
import urllib.request
USER_AGENT = "pytorch/vision"
def _get_redirect_url(url: str, max_hops: int = 3) -> str:
initial_url = url
headers = {"Method": "HEAD", "User-Agent": USER_AGENT}
for _ in range(max_hops + 1):
with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
if response.url == url or response.url is None:
return url
url = response.url
else:
raise RecursionError(
f"Request to {initial_url} exceeded {max_hops} redirects. The last redirect points to {url}."
)
url = "https://www.nass.usda.gov/Research_and_Science/Cropland/Release/datasets/2022_30m_cdls.zip"
url = _get_redirect_url(url)
assert url == url
print(url)
Interestingly enough, it works if I remove the "User-Agent": USER_AGENT
from the headers.
A temporary workaround on lightning.ai thanks to @tchaton
from torchgeo.datasets import CDL
# Apply patch to pop User-Agent until we figure out why it hangs
from torchvision.datasets.utils import urllib
original_request = urllib.request.Request
def Request(*args, headers, **kwargs):
if "User-Agent" in headers:
headers.pop("User-Agent")
return original_request(*args, headers=headers, **kwargs)
urllib.request.Request = Request
dataset = CDL(years=[2022], download=True, paths="./data")
print(dataset)
However when I go to plot a sample I get the error
RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement
I suspect this error is due to setting a crs that is different from the native dataset crs, as when I don't do this there is no error