handson-ml3 icon indicating copy to clipboard operation
handson-ml3 copied to clipboard

[BUG] Chapter 2: Section "Download the data", buggy implementation for load_housing_data() function

Open alimoameri opened this issue 1 year ago • 1 comments

The implementation for load_housing_data() is as following:

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

Based on this implementation if the file datasets/housing.tgz exists, it just reads the datasets/housing/housing.csv and returns. It may be a case that datasets/housing.tgz exists but datasets/housing/housing.csv dosen't. Therefor the code will run to FileNotFoundError. The correct implementation should be like this:

def load_housing_data():
  tarfile_path = Path(f'datasets/housing.tgz')
  
  if not tarfile_path.is_file():
    Path.mkdir(Path('datasets'), parents=True, exist_ok=True)
    response = requests.get('https://github.com/ageron/data/raw/main/housing.tgz')
    with open(tarfile_path, 'wb') as f:
      f.write(response.content)

  with tarfile.open(tarfile_path) as housing_tarball:
    housing_tarball.extractall(path="datasets")
  return pd.read_csv(Path("datasets/housing/housing.csv"))

If datasets/housing.tgz exists, extract and then read it. If it dosen't, download it, extract it and then read it.

alimoameri avatar Aug 22 '24 23:08 alimoameri

I saw the same one too. If you delete the housing folder then the code will throw an error at the read_csv part

Naseef03 avatar Sep 18 '24 22:09 Naseef03

For production code, you're totally right, but to avoid increasing the size of the book, and to keep things focused on the main points, I try to focus only on the "happy path", meaning that I generally don't handle edge cases (in particular, I rarely use try blocks). In the happy path, the tar file is either there and the uncompressed folder as well, or it's not there and neither is the folder. The case where one is there and not the other is considered an edge case. I hope this makes sense.

ageron avatar Oct 14 '25 01:10 ageron