Google Research Datasets
Google Research Datasets
tydiqa
TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trainin...
uibert
It includes two datasets that are used in the downstream tasks for evaluating UIBert: App Similar Element Retrieval data and Visual Item Selection (VIS) data. Both datasets are written TFRecords.
uninum
A database of number names for 186 languages, locales, and scripts
Video-Timeline-Tags-ViTT
A collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
WebRED
WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.
wiki-reading
This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English Wiki...
swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask promptin...
AIS
AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable t...