TrackingNet-devkit icon indicating copy to clipboard operation
TrackingNet-devkit copied to clipboard

can't download the dataset

Open bagusyusuf opened this issue 5 years ago • 27 comments

hi author,

my name is bagus, regarding your disclaimer, "In case an error such as Permission denied: https://drive.google.com/uc?id=<ID>, Maybe you need to change permission over 'Anyone with the link'? occurs, please check your internet connection and run again the script."

I have to try to re-download this a few time and still report the same error, then I check my internet connection, and it seems ok, and I try to access your link on my browser and the google drive page report that too many users access your file, and there is a user limit, is this normal ??

kindly need your help, I can't download all the dataset, the only downloaded files is the annotation.

best regards, bagus

bagusyusuf avatar Oct 22 '18 18:10 bagusyusuf

the same issue. I use chrome to download the url, i got the following message

img

Jiangfeng-Xiong avatar Oct 23 '18 02:10 Jiangfeng-Xiong

Dear @BagusYusuf @Jiangfeng-Xiong ,

Thank you for raising this issue.

The data are currently hosted on Google Drive and it seems that there is a download limit of 10TB per day. Our script is smart enough to not download twice the data, but it is obviously an issue since the complete dataset weight >1TB.

It actually works for now, you can go ahead and try to download it again. We are currently investigating for a better solution to share the data and will update you as soon as possible.

Best,

SilvioGiancola avatar Oct 23 '18 07:10 SilvioGiancola

Any update? The google drive is not accessible in China Mainland. Would you consider Baidu Drive? https://login.bce.baidu.com/?lang=en

shenh10 avatar Jun 15 '19 05:06 shenh10

@shenh10 Have you downloaded the whole dataset and can share with me? I can't also download it in Beijing.

shuida avatar Jul 03 '19 13:07 shuida

I have run into the same problem, who have the dataset on baidu Drive? I have tried many times, but not success.

ANULLL avatar Jul 23 '19 10:07 ANULLL

We are currently not supporting Baidu Drive. If anyone has a copy of the dataset on Baidu Drive, we would be more than happy to support him for its contribution, and update the README with instruction on how to download TrackingNet in China Mainland.

SilvioGiancola avatar Jul 24 '19 06:07 SilvioGiancola

@shuida Nope... Failed

shenh10 avatar Jul 25 '19 05:07 shenh10

anybody help? upload the dataset to Baidu yunpan.

RogerYu123 avatar Aug 07 '19 08:08 RogerYu123

wondering if people are still able to download data using this devkit? I am consistently hitting this "Maybe you need to change permission over..." for last 10 days. I was only able to download maybe <5 zips per day

Any solution or perhaps 3rd party download? Thanks in advance.

rambleramble avatar Feb 06 '20 23:02 rambleramble

Guess Google is limiting downloads per client, i was downloading annotations and got the same error at 41%, tried again by using VPN to connect to my office network and it worked for another ~40%. Maybe you can add this tip to your readme. @SilvioGiancola. Also tried renewing my WAN IP from my router but that didnt work.

ghost avatar Feb 18 '20 16:02 ghost

I found a solution, it works fine on my machine, though i dont' really know why it works. It's not about Google's limits on download requests. Apparently you should send your google account login information along with your https requests:

HEADERS = {"User-Agent": "some string"} 
GOOGLE_LOGIN_COOKIE_STR = "some string"

# This function transfers a cookie string to a requests.cookies.RequestsCookieJar object.
def cookie_str2jar(cookie_str):
    cookie_dict = {}
    for item in cookie_str.split(';'):
        item = item.lstrip().rstrip()
        idx_eq = item.find('=')
        key = item[:idx_eq]
        value = item[idx_eq:].lstrip('=')
        cookie_dict[key] = value
    cookie_jar = requests.cookies.merge_cookies(requests.cookies.RequestsCookieJar(), cookie_dict)
    return cookie_jar

GOOGLE_LOGIN_COOKIE_JAR = cookie_str2jar(GOOGLE_LOGIN_COOKIE_STR)

You can find the above two strings in your web browser when you go to the google main page and log into your google account. Python's requests package has its own string for "User-Agent", I'm not sure if this string works, I used the one copied from the web browser, and it looks like this:

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE"}

Then, you have to change the first a few lines of download() function in downloader.py:

def download(url, output, quiet):
    url_origin = url
    sess = requests.session()
    retry = Retry(total=100)    # Better have a big number of retries.
    adapter = HTTPAdapter(max_retries=retry)
    sess.mount('http://', adapter)
    sess.mount('https://', adapter)
    sess.keep_alive = False    # I'm not sure if this line is neccessary, it works fine for me.

    is_gdrive = is_google_drive_url(url)

    count = 0
    while True:
        count += 1
        # Must have these two lines of if codes below.
        # Apparently you have to send multiple requests for one zip file.
        # On my machine, it takes at most 3 requests to get the proper download response.
        # It works fine on my machine, but I really don't know why.
        if count <= 10: 
            res = sess.get(url, headers=HEADERS, cookies=GOOGLE_LOGIN_COOKIE_JAR, stream=True)
        
        # The rest part of this function is all the same as the original version.
        if 'Content-Disposition' in res.headers:
            # This is the file
            break
        if not is_gdrive:
            break
        ......

Also, I disabled IPv6 on my VPN host, but i'm not sure if it matters or not. Also, for you guys in China, in my experience, VPN location has a big influence on download speed. I used a vultr host located in New York, equipped with V2Ray and Google BBR. It downloaded the dataset with nearly full speed of my internet service.

qdLMF avatar Feb 24 '20 09:02 qdLMF

To anyone who still has issue downloading TrackingNet, we are currently trying to find more reliable solutions. For now, we have created back up links to download full chunks of training (and the testing chunk). It's still hosted on Google Drive, but will be easier to spread around the community using alternative sharing platforms (e.g. Baidu, Dropbox, good old HDD,...).

Here are two back up links: [link1] [link2]

Now, it appears that Google Drive is limiting the download if you are not signed in with you gmail account. If you have any issue downloading it, please make sure you are signed in google drive with you gmail account. We will track the situation in the next days.

SilvioGiancola avatar Feb 25 '20 15:02 SilvioGiancola

@SilvioGiancola have you considered http://academictorrents.com/?

1e100 avatar Apr 16 '20 11:04 1e100

@1e100 academictorrents only builds a tracker for academic-related torrents, it does not host data, nor seed torrents. And torrenting has its limitation too:

  • First, universities usually block torrents (arguably solved with a VPN). TrackingNet is meant for research only, mostly universities.
  • Second, it requires everyone in the community to seed for everyone else. Similarly, universities won't allow for torrents, and companies usually won't care about seeding.

The real question here is: Would you consider seeding TrackingNet? I don't mind creating a torrent (anyone can push TrackingNet as a torrent) but that could have an solution if and only if everyone is seeding proportionally.

SilvioGiancola avatar Apr 19 '20 20:04 SilvioGiancola

I hear you @SilvioGiancola, but right now your dataset is basically impossible to download. I tried to download it right after midnight PST, and Google already says “bandwidth exceeded”. There’s got to be a solution of some sort for this.

1e100 avatar Apr 19 '20 20:04 1e100

I am pushing a version of TrackingNet on academictorrent. Please seed as much as possible as my upload rate is a fraction of what Google Drive can provide. Any feedback is appreciated.

SilvioGiancola avatar Apr 19 '20 21:04 SilvioGiancola

Thank you, @SilvioGiancola. On my end I'll download and seed. I don't know how long I'll be able to keep it up, but hopefully enough for the interested parties to download and propagate further. I'll seed for at least a few days. Hopefully others will join in and carry the torch in a more permanent fashion.

1e100 avatar Apr 19 '20 21:04 1e100

@SilvioGiancola so far, no download progress. Are you sure you're forwarding the right port range? For e.g. aria2c it's 6881 - 6999, which is more ports than some other BitTorrent implementations.

1e100 avatar Apr 19 '20 21:04 1e100

@1e100 I'm using Transmission Qt GUI, it's still verifying the local data. 180GB/1.14TB done in the last 30min, I guess it will still take ~4h.

SilvioGiancola avatar Apr 19 '20 21:04 SilvioGiancola

OK, I'll report back in 4-5 hours again.

1e100 avatar Apr 19 '20 21:04 1e100

Still nothing.

1e100 avatar Apr 20 '20 03:04 1e100

It will require me more time to figure it out, I'll keep you posted. In the meantime, can you reach on slack for further debugging?

SilvioGiancola avatar Apr 20 '20 08:04 SilvioGiancola

We are currently experimenting a bittorrent solution to share TrackingNet among the tracking community. The torrent is available on https://academictorrents.com/details/1faf1b53cc0099d2206f02be42b5688952c3c6b3.

It may be very slow at the beginning, but it will improve once more people will require the a copy. Here are some guidelines:

  • I hope everyone will play fair and seed its torrent as we are currently doing.
  • Feel free to share the torrent between colleagues, seedbox or mirroring server.
  • The google drive backup links [link1] [link2] are still available, but capped in daily bandwidth .
  • If you already have downloaded a chunk from the backup links, please put it in the output folder of your torrent: it won't download it again and will save you precious download time.

SilvioGiancola avatar Apr 22 '20 03:04 SilvioGiancola

Hi there, Currently I am also experiencing downloading problem here, as I cannot get access to google drive in mainland China. I wonder if I use the backup link(like ownCloud updated in May), does that mean I need to manually download the datasets insted of using download_TrackingNet.py? Or I just need to change the url in that file? Also, I have tried to manually download TRAIN0.zip from ownCloud and still I am experiencing a very low downloading speed.

HAoYifei996 avatar May 12 '20 17:05 HAoYifei996

Yes, you should download the dataset manually if you are using any alternative solution (ownCloud, torrent, GDrive backup).

May I ask you what speed are you reaching for the ownCloud solution? If it is not fast enough for you, I would then recommend you the torrent collection on academictorrents: https://academictorrents.com/collection/trackingnet.

SilvioGiancola avatar May 13 '20 07:05 SilvioGiancola

@SilvioGiancola Thank you for your reply. I was reaching a speed around only 10k/s, which is impossible for me to download the whole dataset :( I don't know if this was caused by my location, my internet looks good to me. I will try the torrent solution see if it works. Thanks again for your help!

HAoYifei996 avatar May 13 '20 15:05 HAoYifei996

I guess the problem originates from your location. Collaborators in Europe were able to have 15MB/sec download speed. Are you using your university connection? The ownCloud storage is part of a Globus network (https://www.globus.org/) that optimize data transfer across universities and research institutions in the world.

Alternatively, you should try the academictorrent collection https://academictorrents.com/collection/trackingnet. Try the 13 torrents for the 12 train chunks and the test chunk, parallelizing the download across multiple torrents would provides you a faster bandwidth.

SilvioGiancola avatar May 14 '20 07:05 SilvioGiancola