GitHub’s google-research-datasets/wit is a large multilingual dataset of 37M+ image-text sets with 11M+ unique images across 100+ languages. It enables users to access and use data from multiple languages and sources.

Main benefits to the user include:

access to large datasets, ability to use data from multiple languages, ability to search datasets quickly and easily.

Possible use cases could include using the datasets for natural language processing research, machine learning projects or training GPT text generation models.

Screenshots

WIT by Google AI - website homepage screenshot