[BUG] Chapter 2: Section "Download the data", buggy implementation for load_housing_data() function
The implementation for load_housing_data() is as following:
def load_housing_data():
tarball_path = Path("datasets/housing.tgz")
if not tarball_path.is_file():
Path("datasets").mkdir(parents=True, exist_ok=True)
url = "https://github.com/ageron/data/raw/main/housing.tgz"
urllib.request.urlretrieve(url, tarball_path)
with tarfile.open(tarball_path) as housing_tarball:
housing_tarball.extractall(path="datasets")
return pd.read_csv(Path("datasets/housing/housing.csv"))
Based on this implementation if the file datasets/housing.tgz exists, it just reads the datasets/housing/housing.csv and returns. It may be a case that datasets/housing.tgz exists but datasets/housing/housing.csv dosen't. Therefor the code will run to FileNotFoundError. The correct implementation should be like this:
def load_housing_data():
tarfile_path = Path(f'datasets/housing.tgz')
if not tarfile_path.is_file():
Path.mkdir(Path('datasets'), parents=True, exist_ok=True)
response = requests.get('https://github.com/ageron/data/raw/main/housing.tgz')
with open(tarfile_path, 'wb') as f:
f.write(response.content)
with tarfile.open(tarfile_path) as housing_tarball:
housing_tarball.extractall(path="datasets")
return pd.read_csv(Path("datasets/housing/housing.csv"))
If datasets/housing.tgz exists, extract and then read it. If it dosen't, download it, extract it and then read it.
I saw the same one too. If you delete the housing folder then the code will throw an error at the read_csv part
For production code, you're totally right, but to avoid increasing the size of the book, and to keep things focused on the main points, I try to focus only on the "happy path", meaning that I generally don't handle edge cases (in particular, I rarely use try blocks). In the happy path, the tar file is either there and the uncompressed folder as well, or it's not there and neither is the folder. The case where one is there and not the other is considered an edge case. I hope this makes sense.