compsci.science
computer "science"  predictive theory  experiments  peer review  datasets  about

Publicly available datasets

Large real-world datasets are useful to pose bigger challenges to studied methods and are more likely to cover a variety of distributions and patterns. While the most interesting datasets are not available to the public due to commercial interests and ethical considerations, the following list aims to compile some large publicly available datasets that are approriate for research purposes.

Tabular Data

sourcedatasetcolumnsrows
Gaia space craft, European Space Agencymap of stars (and more) tensbillions
OpenStreetMap.orgcrowdsourced gps coordinates32.7 billion
University of Columbia & Facebook Connectivity Lab high-resolution population grids based on satellite imagerymillionsmillions
University of California, Irvinesimulated particle detector events related to higgs bosonstensmillions
ETH Zurichpoint cloud data3millions

Network Data

sourcedatasetnodesedges
arnetminer.orgdblp citation network datamillionsmillions
KAIST university, South Koreatwitter network datamillionsmillions

Relational Data

sourcedata
TPCTPC-H benchmark (synthetic data)
TPCTPC-DS benchmark (synthetic data)
paperJoin Order Benchmark
IMDbmovie/series meta data

Repositories

providerrepository
University of California, IrvineUC Irvine Machine Learning Repository