datasets
datasets copied to clipboard
downloading got Google Drive virus scan warning page rather than data files
/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET
Short description
I am trying to load/download some datasets, mainly summarization ones. But instead of the *.tgz
files, I got an HTML warning page telling me that the file is too large for Google Drive to scan virus and asking me whether to download anyway.
Environment information
-
Operating System: Ubuntu Linux 20.04
-
Python version: 3.10.4
-
tensorflow-datasets
/tfds-nightly
version: tensorflow-datasets 4.5.2 -
tensorflow
/tf-nightly
version: tensorflow 2.8.0 -
Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ?
N/A
Reproduction instructions
CoLab link here https://colab.research.google.com/drive/1F5jHy8o0_va6aIvuaB6H-EqfiWUrC9Ld#scrollTo=k3k-fYTuxw54
Python code below
import tensorflow_datasets as tfds
tfds.load('cnn_dailymail', split='test')
If you share a colab, make sure to update the permissions to share it.
Link to logs
I got the error message:
NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ, downloaded to /root/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndKG8BdJPpt2iRo6Dpzz23CByJuAePEilB-pxbcBCHaWDs.tmp.cfa00e128d6c4efab209b7a281915239/uc, has wrong checksum. This might indicate:
- The website may be down (e.g. returned a 503 status code). Please check the url.
- For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See https://github.com/tensorflow/datasets/issues/1482
- The original datasets files may have been updated. In this case the TFDS dataset builder should be updated to use the new files and checksums. Sorry about that. Please open an issue or send us a PR with a fix.
- If you're adding a new dataset, don't forget to register the checksums as explained in: https://www.tensorflow.org/datasets/add_dataset#2_run_download_and_prepare_locally
Below is the downloaded file under my ~/tensorflow_datasets/download/<a random hash>
:
<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="sj0n0DT6hrEpyOgybFb8Iw">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
Expected behavior I was expecting to see progress bars of downloading for a while and then a return like this
<PrefetchDataset element_spec={'abstract': TensorSpec(shape=(), dtype=tf.string, name=None), 'description': TensorSpec(shape=(), dtype=tf.string, name=None)}>
Additional context
This error repeats for many large datasets but not small ones. For example, I had no problem with cifar10
. But I had the same issue with big_patent
.
@forrestbao I'm facing the same problem when directly using wget
to download large datasets from Google Drive
on Linux
. Did you find any solution for this error?
use this. https://github.com/Rushikesh-Malave-175/GD-Resume ik im like super late lol. downloads only one file and only works on widowstho
You can work around the "Google Drive can't scan this file for viruses." message if you know the download URL.
Just add the parameter confirm=t
to the URL and it should work.
Example using OP's file:
wget 'https://drive.usercontent.google.com/download?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&export=download&authuser=1&confirm=t' -O cnn_stories.tgz
@vinismarques doesn't seem to work. When i inspect the url behind the download anyway button, then i see some uuid parameter and "at" parameter which i cannot decrypt
@rome-legacy the at
parameter you see when inspecting the form HTML might not be necessary.
What worked for me was to add the confirm=t
to the URL directly. Get the URL of the page you see the "Download anyway" button and add the confirm param. It will look something like this:
.../download?id=ID_HERE&export=download&confirm=t
It might also have an authuser
param in the URL, you can keep it.