jekyll icon indicating copy to clipboard operation
jekyll copied to clipboard

Issue with dataset and codebook for clustering-visualizing-word-embeddings

Open charlottejmc opened this issue 1 year ago • 8 comments

I've noticed a discrepancy between the code given inside the lesson under Load the Data, and the counterpart given in the associated codebook.

In the codebook, we see:

# Name of the file
fn = 'ph-tutorial-data-cleaned.parquet'

# See if the data has already been downloaded, and
# if not, download it from the web site. We save a
# copy locally so that you can run this tutorial
# offline and also spare the host the bandwidth costs
if os.path.exists(os.path.join('data',fn)):
    df = pd.read_parquet(os.path.join('data',fn))
else:
    # We will look for/create a 'data' directory
    if not os.path.exists('data'):
        os.makedirs('data')

    # Download and save
    df = pd.read_parquet(f'http://orca.casa.ucl.ac.uk/~jreades/data/{fn}')
    df.to_parquet(os.path.join('data',fn))

This code downloads the dataset directly from http://orca.casa.ucl.ac.uk/~jreades/data/, where it pulls the file with the filename fn = 'ph-tutorial-data-cleaned.parquet'.

However, we now host the dataset directly on our Zenodo repository through a live DOI https://doi.org/10.46430/phen0112...

The code in the lesson reflects this change, as seen under # Download and save:

# Name of the file
fn = 'ph-tutorial-data-cleaned.parquet'

# See if the data has already been downloaded, and
# if not, download it from the website
if os.path.exists(os.path.join('data',fn)):
    df = pd.read_parquet(os.path.join('data',fn))
else:
    # We will look for/create a 'data' directory
    if not os.path.exists('data'):
        os.makedirs('data')
   
# Download and save
df = pd.read_parquet(f'https://doi.org/10.46430/phen0112{fn}')
df.to_parquet(os.path.join('data',fn))

I attempted to edit the code in the codebook to use the same DOI. However, this leads to an error when I actually try to run the code on Google Colab.

As far as I can tell, this is because the whole code block is configured to work with the initial URL, http://orca.casa.ucl.ac.uk/~jreades/data/ (it looks for the data directory and the filename, which it doesn't find through the DOI).

As such, I believe we might need to update the code block both in the lesson and in the codebook to be able to run it using the DOI itself, or, alternatively, the .zip file hosted on Zenodo.

charlottejmc avatar Jan 31 '24 15:01 charlottejmc

The author Jon Reades has replied to me outlining his thoughts:

"The underlying issue is that the Zenodo data ref is actually a Zip file, so you would need Python to download the Zip file, unzip it, and then read in the Parquet file. This wasn’t part of the original workflow but the data was bundled up during the publication process and I didn’t twig to this consequence until now.

I think there are two options:

  1. Update the Zenodo resource so that the DOI points to only one file: replace the Zip file with the parquet.
  2. Update the notebook so that it preforms the operation I’ve just outlined above automatically.

Option 1 is a bit tidier, but Option 2 might make more sense if trying to update the DOI is going to be a massive headache."

charlottejmc avatar Jan 31 '24 16:01 charlottejmc

Hello @jreades,

I thought I would tag you in this issue to allow us to collaborate further on updating the 'Load the Data' code block.

Although it requires more edits to the code, we believe it would be better to choose Option 2: update the code so that it accesses the Zenodo resource through its DOI (https://doi.org/10.46430/phen0112), downloads and opens the .zip file from there, and reads in the Parquet file contained within.

I'd like to ask whether you would be able to provide the bit of code which would perform this operation cleanly within the Google Colab notebook. Hopefully this isn't too technically complex?

If this is something you feel able to help with, I believe the most efficient way to collaborate would be for you to post the new code below in reply to this comment, and allow me to paste it in both to the lesson and the codebook on our end.

Do let me know what you think!

Thank you very much for your support with this issue, and apologies again for allowing this to slip through before publication...

Best,

Charlotte ✨

charlottejmc avatar Feb 01 '24 15:02 charlottejmc

Here is some code that should work without the need for additional libraries:

# Adapted from https://stackoverflow.com/a/72503304
import os
from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

# Where is the Zipfile stored on Zenodo?
zipfile = 'clustering-visualizing-word-embeddings.zip'
zipurl  = f'https://zenodo.org/records/7948908/files/{zipfile}?download=1'

# Open the remote Zipfile and read it directly into Python
with urlopen(zipurl) as zipresp:
    with ZipFile(BytesIO(zipresp.read())) as zf:
        for zfile in zf.namelist():
            if not zfile.startswith('__'): # Don't unpack hidden MacOSX junk
                print(f"Extracting {zfile}") # Update the user
                zf.extract(zfile,'.')

This will save the two data files to a folder called clustering-visualizing-word-embeddings which might necessitate a tweak to the first notebook when we tell it where to read the data from.

jreades avatar Feb 08 '24 15:02 jreades

@jreades, thank you very much for your help! This ran perfectly when I tried it in the Google Colab notebook.

Could you just clarify where you anticipate the additional 'tweak' might be needed?

Once we've ironed out these final small changes, I will update the notebook and the code block in the lesson.

charlottejmc avatar Feb 14 '24 17:02 charlottejmc

Ah, right. OK, it's just means that the file we want is no longer in data. I was going to just update the notebook to look in the clustering-visualizing-... directory but I realised that there are actually a fair few reads/writes from that folder. So here's my suggested addition to the above code:

# And rename the unzipped directory to 'data' -- 
# IMPORTANT: Note that if 'data' already exists
# it will (probably) be silently overwritten.
os.rename('clustering-visualizing-word-embeddings','data')

This can just go on the end of the changes above and should ensure that everything after this runs.

Jon

jreades avatar Feb 20 '24 19:02 jreades

@jreades @anisa-hawes here's a screenshot of the error I'm getting in the Colab notebook now Screenshot 2024-02-23 120419

hawc2 avatar Feb 23 '24 17:02 hawc2

There is some missing code—there should be something along the lines of df = pd.read_parquet(… that is run before you can do the ‘list columns’ line. It might be a question of moving that line down or that a line has been accidentally deleted. I Will try to look at it in the next few days. Can someone confirm the URL for the notebook? (Writing this from my phone)

Jon On 23 Feb 2024 at 17:06 +0000, Alex Wermer-Colan @.***>, wrote:

@jreades @anisa-hawes here's a screenshot of the error I'm getting in the Colab notebook now Screenshot.2024-02-23.120419.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jreades avatar Feb 23 '24 17:02 jreades

Sorry, I got there eventually. This should work and replace the similar block of code already in the notebook. I've added a check so that the file is only downloaded if it's not found where it was expected.

# Adapted from https://stackoverflow.com/a/72503304
import os
import pandas as pd

dn = 'data'
fn = 'ph-tutorial-data-cleaned.parquet'

if not os.path.exists(os.path.join(dn,fn)): 
    print(f"Couldn't find {os.path.join('data',fn)}, downloading...")
    from io import BytesIO
    from urllib.request import urlopen
    from zipfile import ZipFile

    # Where is the Zipfile stored on Zenodo?
    zipfile = 'clustering-visualizing-word-embeddings.zip'
    zipurl  = f'https://zenodo.org/records/7948908/files/{zipfile}?download=1'

    # Open the remote Zipfile and read it directly into Python
    with urlopen(zipurl) as zipresp:
        with ZipFile(BytesIO(zipresp.read())) as zf:
            for zfile in zf.namelist():
                if not zfile.startswith('__'): # Don't unpack hidden MacOSX junk
                    print(f"Extracting {zfile}") # Update the user
                    zf.extract(zfile,'.')
    print("  Downloaded.")
    # And rename the unzipped directory to 'data' --
    # IMPORTANT: Note that if 'data' already exists it will (probably) be silently overwritten.
    os.rename('clustering-visualizing-word-embeddings',dn)

print(f"Loading {fn}")
df = pd.read_parquet(os.path.join(dn,fn))

jreades avatar Mar 01 '24 10:03 jreades

Thank you very much @jreades. This is a great help. We'll coordinate the update!

anisa-hawes avatar Mar 01 '24 10:03 anisa-hawes