datasets icon indicating copy to clipboard operation
datasets copied to clipboard

downloading got Google Drive virus scan warning page rather than data files

Open forrestbao opened this issue 2 years ago • 7 comments

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description I am trying to load/download some datasets, mainly summarization ones. But instead of the *.tgz files, I got an HTML warning page telling me that the file is too large for Google Drive to scan virus and asking me whether to download anyway.

Environment information

  • Operating System: Ubuntu Linux 20.04

  • Python version: 3.10.4

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.5.2

  • tensorflow/tf-nightly version: tensorflow 2.8.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

N/A

Reproduction instructions

CoLab link here https://colab.research.google.com/drive/1F5jHy8o0_va6aIvuaB6H-EqfiWUrC9Ld#scrollTo=k3k-fYTuxw54

Python code below

import tensorflow_datasets as tfds
tfds.load('cnn_dailymail', split='test')

If you share a colab, make sure to update the permissions to share it.

Link to logs

I got the error message:

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ, downloaded to /root/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndKG8BdJPpt2iRo6Dpzz23CByJuAePEilB-pxbcBCHaWDs.tmp.cfa00e128d6c4efab209b7a281915239/uc, has wrong checksum. This might indicate:

  • The website may be down (e.g. returned a 503 status code). Please check the url.
  • For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See https://github.com/tensorflow/datasets/issues/1482
  • The original datasets files may have been updated. In this case the TFDS dataset builder should be updated to use the new files and checksums. Sorry about that. Please open an issue or send us a PR with a fix.
  • If you're adding a new dataset, don't forget to register the checksums as explained in: https://www.tensorflow.org/datasets/add_dataset#2_run_download_and_prepare_locally

Below is the downloaded file under my ~/tensorflow_datasets/download/<a random hash>:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="sj0n0DT6hrEpyOgybFb8Iw">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://drive.google.com/uc?export=download&amp;id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

Expected behavior I was expecting to see progress bars of downloading for a while and then a return like this

 <PrefetchDataset element_spec={'abstract': TensorSpec(shape=(), dtype=tf.string, name=None), 'description': TensorSpec(shape=(), dtype=tf.string, name=None)}>

Additional context

This error repeats for many large datasets but not small ones. For example, I had no problem with cifar10. But I had the same issue with big_patent.

forrestbao avatar May 15 '22 22:05 forrestbao

@forrestbao I'm facing the same problem when directly using wget to download large datasets from Google Drive on Linux. Did you find any solution for this error?

parshinsh avatar Aug 23 '22 20:08 parshinsh

use this. https://github.com/Rushikesh-Malave-175/GD-Resume ik im like super late lol. downloads only one file and only works on widowstho

Rushikesh-Malave-175 avatar Jun 07 '23 10:06 Rushikesh-Malave-175

You can work around the "Google Drive can't scan this file for viruses." message if you know the download URL.

Just add the parameter confirm=t to the URL and it should work.

Example using OP's file:

wget 'https://drive.usercontent.google.com/download?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&export=download&authuser=1&confirm=t' -O cnn_stories.tgz

vinismarques avatar Apr 09 '24 16:04 vinismarques

@vinismarques doesn't seem to work. When i inspect the url behind the download anyway button, then i see some uuid parameter and "at" parameter which i cannot decrypt

rome-legacy avatar Apr 18 '24 19:04 rome-legacy

@rome-legacy the at parameter you see when inspecting the form HTML might not be necessary.

What worked for me was to add the confirm=t to the URL directly. Get the URL of the page you see the "Download anyway" button and add the confirm param. It will look something like this:

.../download?id=ID_HERE&export=download&confirm=t

It might also have an authuser param in the URL, you can keep it.

vinismarques avatar Apr 19 '24 18:04 vinismarques