compsci.science - Antagonistic experiments.

Antagonistic Experiments

Most researchers intuitively begin experiments with the presumption that their devised approach is good and the burden of proof lies on the data to show that it is not. Even if the results then speak ill of the proposed approach, the researcher might be tempted to blame the dataset and silently replace it with another more favourable one.

Unless it seeks to support a hypothesis, tinkering is not science. It is not science to assemble parts to "see what happens." — Former ACM President Peter J. Denning [source]

In contrast, the scientific approach is more antagonistic in nature:

Null hypothesis: Existing approaches are comparable or better than proposed approach
Experiments: Try to prove null hypothesis as best as possible, utilising problem instances that are challenging to new approach.
Analysis: Would we expect such results if the null hypothesis was true, i.e., if existing approaches are comparable or better? Or do we need to adopt an alternative hypothesis, i.e., the new approach is better?

Instead of acting as an ally of our own creation, the scientific approach is to act as their worst enemy. If even after exposing the new approach to the harshest conditions, the approach survives and does well, then clearly our initial position was wrong and this finding should be shared with others. Practically, this means to not treat a paper like an advertisement for the proposed approach, but as a proving ground where the most challenging problem instances are employed. The goal is to show that existing approaches are not always the best choice, that the proposed approach is solid and extends our capabilities. We are then encouraged to expect to see some weaknesses of the new approach and should perhaps even be suspicious if none are shown. One challenging aspect that is not covered here is how to handle different trade-offs between multiple criteria, e.g., one algorithm can be faster but require more space and algorithms might perform well for different classes of problem instances. A set of large publicly available datasets can be found [here].

Honest Presentation

	Pop Culture / Advertisement	Academia
Phenomenon	Excessive photoshopping of celebrity/model pictures	Sensationalist reporting and superficial literature research
Consequence	Unrealistic body image	Unrealistic expectations about novelty in top-level publications

The manual editing of celebrity pictures (photoshopping) has gone as far as to create unrealistic expectations for body shapes and skin conditions. If we critise such behaviour, it would be hypocritical to ourselve engage in cosmetic distortions of our own works, because overhyped and disingenously written papers will also generate unrealistic expectations. Perhaps someone might think that every single paper in a leading venue has to be an exciting groundbreaking result. As a result, solid work might be overlooked in favour of dubious results that are believed to have greater potential. Admittedly, some of the perceived novelty by the authors themselves can be attributed to poor literature research or overcomplication of their own approach to a point where its connection to well-established ideas gets lost. Such dynamics do not make the field look good in the long run and establish the wrong incentive structures for good research to occur. Excitement should be generated not by poorly supported sensationalist claims, but through narratives that make it easier to understand and appreciate the proposed methods.

Publicly available datasets

Large real-world datasets are useful to pose bigger challenges to studied methods and are more likely to cover a variety of distributions and patterns. While the most interesting datasets are not available to the public due to commercial interests and ethical considerations, the following list aims to compile some large publicly available datasets that are approriate for research purposes.

Tabular Data

source	dataset	columns	rows
Gaia space craft, European Space Agency	map of stars (and more)	tens	billions
OpenStreetMap.org	crowdsourced gps coordinates	3	2.7 billion
University of Columbia & Facebook Connectivity Lab	high-resolution population grids based on satellite imagery	millions	millions
University of California, Irvine	simulated particle detector events related to higgs bosons	tens	millions
ETH Zurich	point cloud data	3	millions

Network Data

source	dataset	nodes	edges
arnetminer.org	dblp citation network data	millions	millions
KAIST university, South Korea	twitter network data	millions	millions

Relational Data

source	data
TPC	TPC-H benchmark (synthetic data)
TPC	TPC-DS benchmark (synthetic data)
paper	Join Order Benchmark
IMDb	movie/series meta data

Repositories

provider	repository
University of California, Irvine	UC Irvine Machine Learning Repository