datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Failed to download and load `the300w_lp` dataset through the current Google Drive URL

Open Inokinoki opened this issue 1 year ago • 1 comments

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description

Dataset the300w_lp cannot be loaded due to Google Drive changes.

Environment information

  • Operating System: macos

  • Python version: 3.11.9

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets==4.9.5

  • tensorflow/tf-nightly version: tensorflow==2.15.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Yes

Reproduction instructions

tfds.load("the300w_lp", with_info=True)

If you share a colab, make sure to update the permissions to share it.

Link to logs If applicable, https://gist.github.com/Inokinoki/36ee1c47cf4ee2b0bef4754900189335

Expected behavior Load the dataset correctly.

Additional context I investigated the issue, it seems that Google Drive has a redirect with a warning for non-scanned files:

image
curl -L "https://drive.google.com/uc?export=download&id=0B7OEHD3T4eCkVGs0TkhUWFN6N1k"         
<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="Cnthv5s43ZEpklfe8-kwQA">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0B7OEHD3T4eCkVGs0TkhUWFN6N1k">300W-LP.zip</a> (2.6G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="0B7OEHD3T4eCkVGs0TkhUWFN6N1k"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="4fcfdc71-ca23-4264-8c6a-1322c7b1c73e"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>%

Using the new URL with confirm=t can resolve this issue.

Inokinoki avatar Jul 17 '24 17:07 Inokinoki

It seems that some other datasets have the similar issues as well...

e.g., gov_report: https://drive.google.com/uc?export=download&id=1ik8uUVeIU-ky63vlnvxtfN2ZN-TUeov2

Inokinoki avatar Jul 17 '24 17:07 Inokinoki