Discussion questions for David Donoho’s article.
Donoho claims that “statisticians have already been pursuing daily” the tasks of “collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of … applications.”1
Consider the following two paraphrasings2:
A data scientist is a professional who uses scientific methods to liberate and create meaning from raw data.
An applied statistician is a professional who uses methodology to make inferences from data.
Consider Broman’s quotation3:
When physicists do mathematics, they don’t say they’re doing number science. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics. … You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term ‘statistics’.
Consider these quotations4:
In those less-hyped times, the skills being touted today were unnecessary. Instead, scientists developed skills to solve the problem they were really interested in, using elegant mathematics and powerful quantitative programming environments modeled on that math.
The new skills attracting so much media attention are not skills for better solving the real problem of inference from data; they are coping skills for dealing with organizational artifacts of large-scale cluster computing.
The new skills cope with severe new constraints on algorithms posed by the multiprocessor/networked world. In this highly constrained world, the range of easily constructible algorithms shrinks dramatically compared to the single-processor model, so one inevitably tends to adopt inferential approaches which would have been considered rudimentary or even inappropriate in olden times.
The databases, software, and workflow management taught in a given Data Science Masters program are unlikely to be the same as those used by one specific employer.5
Tukey wrote that7:
Four major influences act on data analysis today: 1. The formal theories of statistics. 2. Accelerating developments in computers and display devices. 3. The challenge, in many fields, of more and ever larger bodies of data. 4. The emphasis on quantification in an ever wider variety of disciplines.
Consider this discussion of the Common Task Framework (CTF)8:
This combination leads directly to a total focus on optimization of empirical performance, which as Marc Liberman has pointed out, allows large numbers of researchers to compete at any given common task challenge, and allows for efficient and unemotional judging of challenge winners. It also leads immediately to applications in a real-world application.
Consider this anecdote from Irizarry9:
Rafael Irizarry gave a convincing example of exploratory data analysis of GWAS data, studying how the data row mean varied with the date on which each row was collected, convince the field of gene expression analysis to face up to some data problems that were crippling their studies.
Donoho says of greater data science research (GDS)10:
it is not traditional research in the sense of mathematical statistics or even machine learning; it has proven to be very impactful on practicing data scientists;
This effort may have more impact on today’s practice of data analysis than many highly-regarded theoretical statistics papers.
Consider these statements about “Science about Data Science”:
This meta study demonstrates that by both accessing all previous data from a group of studies and trying all previous modeling approaches on all datasets one can obtain both a better result and a fuller understanding of the problems and shortcomings of actual data analyses.11
Instead of deriving optimal procedures under idealized assumptions within mathematical models, we will rigorously measure performance by empirical methods, based on the entire scientific literature or relevant subsets of it.12
Donoho derides “fancy” methods through a few empirical comparisons:
The implicit point is again that effort devoted to fancy-seeming methods is misplaced compared to other, more important issues.13
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".