Google Research Datasets

Results 70 repositories owned by Google Research Datasets

wit

961
Stars
39
Forks
Watchers

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Objectron

2.2k
Stars
264
Forks
Watchers

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera...

conceptual-12m

327
Stars
16
Forks
Watchers

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

coarse-discourse

237
Stars
33
Forks
Watchers

A large corpus of discourse annotations and relations on ~10K forum threads.

conceptual-captions

482
Stars
25
Forks
Watchers

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

cvss

167
Stars
10
Forks
Watchers

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

dakshina

178
Stars
17
Forks
Watchers

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text,...

wiki-split

121
Stars
5
Forks
Watchers

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

query-wellformedness

83
Stars
12
Forks
Watchers

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.