computer "science"  predictive theory  experiments  peer review  blog  about

Computer "science"

Computer science deals with machines, employs mathematics and has perfectly reproducible experiments. Science! Right? What if the mathematics do not predict real-world behaviour? What if the experimental design does not allow to generalise beyond a handful problem instances? What are we missing here? Consider the following paradigms:

AReproduceable experiments and objective manner to analyse results
BFormal computation model and asymptotic complexity results
CTargeted observations that allow to distinguish between possible realities

While the engineering paradigm A and mathematical paradigm B may achieve a scientific appearance due to plots and formulas, only the science paradigm C specifically targets observations (e.g., empirical and theoretical results) that build conclusive knowledge about computational realities (more details at the bottom of this page).

A hopeful quote from over 15 years ago:

The science paradigm has not been part of the mainstream perception of computer science. But soon it will be. — Former ACM President Peter J. Denning [source]

Are we there yet?

It could be argued that to this day computer science leaves itself open to the following criticisms: Asymptotic computational complexity has been a staple of computer science for many decades and while it is a very useful heuristic, it is perhaps not surprising that simple formulas can only provide some overly simplistic view of computational realities which are often more complex [more details]. There have been some efforts to address cherrypicking and replication problems:

How can we get there?

While previous efforts have been a step in the right direction, a general culture change is needed, specifically: Clearly, a lot of it comes down to peer-review of publication manuscripts and applications for grants or positions. If reviewers set unrealistic expectations that all new approaches should be better in every regard, then authors are forced to run a tailored benchmark that shows the new approach in the best light. Furthermore, if old approaches only need to be better in one way, it would almost guarantee getting stuck with bad approaches. If reviewers only consider experimental or purely theoretical results, then there is no incentive for predictive theory although it has larger scientific value.

How do we compare to sciences studying human subjects?

One way to reflect on computer science, is to compare it to other sciences, specifically ones where repeatability of experiments is quite challenging. While social/medical sciences investigate a human population, computer science investigates a population of problem instances:

Computer ScienceMedical/Social Sciences
PopulationProblem instancesHumans
SampleProblem instances used in experimentsHumans participating in experiments
Sample SizeTypically, handful of problem instances (N < 10) Typically, hundreds of participants (N > 100)
Sample TypeTypically a non-random, handpicked sampleTypically a randomised convenience sample
Sample publicly availableTypically yesTypically no
Theorye.g., asymptotic complexitypredictive/descriptive theory
Independent Variableold approaches vs new approachcontrol group vs intervention
Dependent Variableperformance measuresvarious measures
Study Designwithin-subject/repeated-measures experiment (without order effects)various
ReproducibilityLimited to "sample"See replication crisis.
AnalysisDescription of effect sizesStatistical/effect size analysis
GeneralisabilityCompletely subjective and often overstatedCritically discussed and investigated

In medical/social sciences most of the statistical analysis is aimed to rule out that the observed effects are merely sampling artifacts (and hence expected results for the null hypothesis). For non-random samples as used in computer science this is not possible and potentially all reported improvements generalise very poorly to other problem instances. It is therefore important to interpret such results very carefully and try to avoid overreaching generalisations. Furthermore, antagonistic sampling can help a bit (that we can think of as the opposite of cherrypicking), i.e., looking for problem instances that are likely harder than a random sample would be.

How would we apply the science paradigm to something like algorithms?

It boils down to the following questions:

Scenarios and hypothesesWhich hypothetical scenarios must be considered? How can they be grouped into hypotheses?
PredictionsWhich observations are expected in each scenario?
MethodologyWhich observations can reliably distinguish the considered scenarios?
ReplicabilityHow can similar observations be replicated?
Testing What are the observations and is it possible to repeat them?
AnalysisWhich considered scenarios can be ruled out?

Which hypothetical scenarios must be considered? How can they be grouped into hypotheses?
For algorithms there are quite a few scenarios to consider. One may for instance consider a table, where the rows are important classes of problem instances, the columns are different priorisations between performance metrics and each table entry rates how well the algorithm fares for a particular class and metric. Each scenario is then a different way to fill out the table. Due to the many scenarios it is often useful to group all scenarios into the ones that reflect a scientifically interesting result (research hypothesis) and the rest (null hypothesis).
Which observations are expected in each scenario?
The expectations for each scenario formalise how one can assess if a scenario table is an apt description of the algorithm.
Which observations can reliably distinguish the considered scenarios?
The study design is about figuring out which empirical and theoretical results need to be collected and derived to rule out most of the scenarios. The goal is to be as conclusive as possible.
How can similar observations be replicated?
How to obtain similar empirical results for other representative problem instances is rarely discussed. Despite that replicability is key to any generalisation claims. While a particular theoretical result is replicable due to the proofs, the broader claims may require to obtain similar results for slightly deviating assumptions and models. Otherwise, the theoretical results can be an artifact of the chosen computation model and presumed performance metrics, which is still of interest, but may not support any broader claims.
What are the observations and is it possible to repeat them?
In computer science it is typically expected that experiments are repeatable. Repeatability is often required to keep researchers honest and aids transparency.
Which considered scenarios can be ruled out?
A discussion of the results then should narrow down which scenarios are plausible given the observations. While it is difficult to publish in computer science when admitting to inconclusive results, it is the typical reality of most distribution-dependent algorithms.

Why science

As a concluding remark some obvious reasons why computer science should aim to live up to its name: