compsci.science
computer "science"  predictive theory  experiments  peer review  blog  about

Antagonistic Experiments

Most researchers intuitively begin experiments with the presumption that their devised approach is good and the burden of proof lies on the data to show that it is not. Even if the results then speak ill of the proposed approach, the researcher might be tempted to blame the dataset and silently replace it with another more favourable one.
Unless it seeks to support a hypothesis, tinkering is not science. It is not science to assemble parts to "see what happens." — Former ACM President Peter J. Denning [source]
In contrast, the scientific approach is more antagonistic in nature: Instead of acting as an ally of our own creation, the scientific approach is to act as their worst enemy. If even after exposing the new approach to the harshest conditions, the approach survives and does well, then clearly our initial position was wrong and this finding should be shared with others. Practically, this means to not treat a paper like an advertisement for the proposed approach, but as a proving ground where the most challenging problem instances are employed. The goal is to show that existing approaches are not always the best choice, that the proposed approach is solid and extends our capabilities. We are then encouraged to expect to see some weaknesses of the new approach and should perhaps even be suspicious if none are shown. One challenging aspect that is not covered here is how to handle different trade-offs between multiple criteria, e.g., one algorithm can be faster but require more space and algorithms might perform well for different classes of problem instances. A set of large publicly available datasets can be found [here].

Honest Presentation

Pop Culture / AdvertisementAcademia
PhenomenonExcessive photoshopping of celebrity/model picturesSensationalist reporting and superficial literature research
ConsequenceUnrealistic body imageUnrealistic expectations about novelty in top-level publications

The manual editing of celebrity pictures (photoshopping) has gone as far as to create unrealistic expectations for body shapes and skin conditions. If we critise such behaviour, it would be hypocritical to ourselve engage in cosmetic distortions of our own works, because overhyped and disingenously written papers will also generate unrealistic expectations. Perhaps someone might think that every single paper in a leading venue has to be an exciting groundbreaking result. As a result, solid work might be overlooked in favour of dubious results that are believed to have greater potential. Admittedly, some of the perceived novelty by the authors themselves can be attributed to poor literature research or overcomplication of their own approach to a point where its connection to well-established ideas gets lost. Such dynamics do not make the field look good in the long run and establish the wrong incentive structures for good research to occur. Excitement should be generated not by poorly supported sensationalist claims, but through narratives that make it easier to understand and appreciate the proposed methods.

Publicly available datasets

Large real-world datasets are useful to pose bigger challenges to studied methods and are more likely to cover a variety of distributions and patterns. While the most interesting datasets are not available to the public due to commercial interests and ethical considerations, the following list aims to compile some large publicly available datasets that are approriate for research purposes.

Tabular Data

sourcedatasetcolumnsrows
Gaia space craft, European Space Agencymap of stars (and more) tensbillions
OpenStreetMap.orgcrowdsourced gps coordinates32.7 billion
University of Columbia & Facebook Connectivity Lab high-resolution population grids based on satellite imagerymillionsmillions
University of California, Irvinesimulated particle detector events related to higgs bosonstensmillions
ETH Zurichpoint cloud data3millions

Network Data

sourcedatasetnodesedges
arnetminer.orgdblp citation network datamillionsmillions
KAIST university, South Koreatwitter network datamillionsmillions

Relational Data

sourcedata
TPCTPC-H benchmark (synthetic data)
TPCTPC-DS benchmark (synthetic data)
paperJoin Order Benchmark
IMDbmovie/series meta data

Repositories

providerrepository
University of California, IrvineUC Irvine Machine Learning Repository