Towards a scientific exploration of computational realities.

compsci.science — computer "science" predictive theory experiments peer review blog about

Computer "science"

Computer science deals with machines, employs mathematics and has perfectly reproducible experiments. Science! Right? What if the mathematics do not predict real-world behaviour? What if the experimental design does not allow to generalise beyond a handful problem instances? What are we missing here? Consider the following paradigms:

A	Reproduceable experiments and objective manner to analyse results
B	Formal computation model and asymptotic complexity results
C	Targeted observations that allow to distinguish between possible realities

While the engineering paradigm A and mathematical paradigm B may achieve a scientific appearance due to plots and formulas, only the science paradigm C specifically targets observations (e.g., empirical and theoretical results) that build conclusive knowledge about computational realities (more details at the bottom of this page).

A hopeful quote from over 15 years ago:

The science paradigm has not been part of the mainstream perception of computer science. But soon it will be. — Former ACM President Peter J. Denning [source]

Are we there yet?

It could be argued that to this day computer science leaves itself open to the following criticisms:

Asymptotic computational complexity: Even average-case/smoothed complexities can be a very poor predictor for real-world behaviour, because for finite problem instances asymptotic formulas are just course approximations.
Handpicked experiments: A handful cherrypicked problem instances are typically not approriate for generalisations.
Disingenous/sensationalist presentation: Colourful narratives are very exciting, but without clear and rigorous evidence it misleads readers and creates unrealistic expectations.

Asymptotic computational complexity has been a staple of computer science for many decades and while it is a very useful heuristic, it is perhaps not surprising that simple formulas can only provide some overly simplistic view of computational realities which are often more complex [more details]. There have been some efforts to address cherrypicking and replication problems:

Established Benchmarks: Not available for most problems. Counteracts cherrypicked experiments, but invites cherrypicked approaches aimed to only perform well for the reference benchmark.
Reproducibility efforts [SIGMOD] [ML/NLP]: Counteracts misreported experiments. Publicly available code facilitates follow-up studies, but does not affect any incentive structures (e.g., peer-review).

How can we get there?

While previous efforts have been a step in the right direction, a general culture change is needed, specifically:

Predictive theory [more details]:
- Theory that is either guaranteed or demonstrated to bound or predict real-world behaviour.
- Worst-case/average-case/smoothed complexities are supplemented by some study of hidden constants, or even non-asymptotic formulas.
- When approriate, theoretical models are devised for sets of problem instances
Antagonistic experimental design [more details]: Alongside representative problem instances, picking some challenging problem instances that are likely harder than most problem instances.
Honest presentation [more details]: Clearly presenting novelties, but also critically discussing limitations and exploring both strengths and weaknesses in the experiments.

Clearly, a lot of it comes down to peer-review of publication manuscripts and applications for grants or positions. If reviewers set unrealistic expectations that all new approaches should be better in every regard, then authors are forced to run a tailored benchmark that shows the new approach in the best light. Furthermore, if old approaches only need to be better in one way, it would almost guarantee getting stuck with bad approaches. If reviewers only consider experimental or purely theoretical results, then there is no incentive for predictive theory although it has larger scientific value.

How do we compare to sciences studying human subjects?

One way to reflect on computer science, is to compare it to other sciences, specifically ones where repeatability of experiments is quite challenging. While social/medical sciences investigate a human population, computer science investigates a population of problem instances:

	Computer Science	Medical/Social Sciences
Population	Problem instances	Humans
Sample	Problem instances used in experiments	Humans participating in experiments
Sample Size	Typically, handful of problem instances (N < 10)	Typically, hundreds of participants (N > 100)
Sample Type	Typically a non-random, handpicked sample	Typically a randomised convenience sample
Sample publicly available	Typically yes	Typically no
Theory	e.g., asymptotic complexity	predictive/descriptive theory
Independent Variable	old approaches vs new approach	control group vs intervention
Dependent Variable	performance measures	various measures
Study Design	within-subject/repeated-measures experiment (without order effects)	various
Reproducibility	Limited to "sample"	See replication crisis.
Analysis	Description of effect sizes	Statistical/effect size analysis
Generalisability	Completely subjective and often overstated	Critically discussed and investigated

In medical/social sciences most of the statistical analysis is aimed to rule out that the observed effects are merely sampling artifacts (and hence expected results for the null hypothesis). For non-random samples as used in computer science this is not possible and potentially all reported improvements generalise very poorly to other problem instances. It is therefore important to interpret such results very carefully and try to avoid overreaching generalisations. Furthermore, antagonistic sampling can help a bit (that we can think of as the opposite of cherrypicking), i.e., looking for problem instances that are likely harder than a random sample would be.

How would we apply the science paradigm to something like algorithms?

It boils down to the following questions:

Scenarios and hypotheses	Which hypothetical scenarios must be considered? How can they be grouped into hypotheses?
Predictions	Which observations are expected in each scenario?
Methodology	Which observations can reliably distinguish the considered scenarios?
Replicability	How can similar observations be replicated?
Testing	What are the observations and is it possible to repeat them?
Analysis	Which considered scenarios can be ruled out?

Which hypothetical scenarios must be considered? How can they be grouped into hypotheses?

For algorithms there are quite a few scenarios to consider. One may for instance consider a table, where the rows are important classes of problem instances, the columns are different priorisations between performance metrics and each table entry rates how well the algorithm fares for a particular class and metric. Each scenario is then a different way to fill out the table. Due to the many scenarios it is often useful to group all scenarios into the ones that reflect a scientifically interesting result (research hypothesis) and the rest (null hypothesis).

Which observations are expected in each scenario?

The expectations for each scenario formalise how one can assess if a scenario table is an apt description of the algorithm.

Which observations can reliably distinguish the considered scenarios?

The study design is about figuring out which empirical and theoretical results need to be collected and derived to rule out most of the scenarios. The goal is to be as conclusive as possible.

How can similar observations be replicated?

How to obtain similar empirical results for other representative problem instances is rarely discussed. Despite that replicability is key to any generalisation claims. While a particular theoretical result is replicable due to the proofs, the broader claims may require to obtain similar results for slightly deviating assumptions and models. Otherwise, the theoretical results can be an artifact of the chosen computation model and presumed performance metrics, which is still of interest, but may not support any broader claims.

What are the observations and is it possible to repeat them?

In computer science it is typically expected that experiments are repeatable. Repeatability is often required to keep researchers honest and aids transparency.

Which considered scenarios can be ruled out?

A discussion of the results then should narrow down which scenarios are plausible given the observations. While it is difficult to publish in computer science when admitting to inconclusive results, it is the typical reality of most distribution-dependent algorithms.

Why science

As a concluding remark some obvious reasons why computer science should aim to live up to its name:

Predictive power for computer system models is crucial for the progression of technology in an increasingly computerised world.
Cognitive biases even amongst elites are proven to cloud people's judgements and scientific methods aim to minimise their impact.
Carefully designed experiments aim to disprove previous explanations for phenomena beyond reasonable doubt, establishing reliable knowledge.
Objective truths sought after by scientific inquiry are a universal destination for people from any part of the world, culture or school of thought.