love, play & inquiry (trochee) wrote,
love, play & inquiry

Greater data science, part 2: data science for scientists

This is part of an open-ended series of marginalia to Donoho’s 50 Years of Data Science 2015 paper.

Many aspects of Donoho’s 2015 “greater data science” can support scientists of other stripes — and not just because “data scientist is like food cook” — if data science is a thing after all, then it has specific expertise that applies to shared problems across domains. I have been thinking a lot about how the outsider-ish nature of the “data science” can provide supporting analysis in a specific domain-tied (“wet-lab”) science.

This is not to dismiss the data science that’s already happening in the wet-lab — but to acknowledge that the expertise of the data scientist is often complementary to the domain expertise of her wet-lab colleague.

Here I lay out three classes of skills that I’ve seen in “data scientists” (more rarely, but still sometimes, in software engineers, or in target-domain experts: these people might be called the “accidental data scientists”, if it’s not circular).

“Direct” data science

Donoho 2015 includes six divisions of “greater data science”:

The activities of Greater Data Science are classified into 6 divisions: 1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Modeling 5. Data Visualization and Presentation 6. Science about Data Science

Greater Data Science is all opportunities to help out “other” sciences.

  • methodological review on data collection and transformation
  • representational review ensuring that — where possible — the best standards for data representation are available; this is a sort of future-proofing and also feeds into cross-methodological analyses (below)
  • statistical methods review on core and peripheral models and analyses
  • visualization and presentation design and review, to support exploration of input data and post-analysis data
  • cross-methodological analyses are much easier to adapt when data representations and transformations conform to agreed-upon standards

Coping with “big” data

  • adaptation of methods for large-scale data cross-cuts most of the above — understanding how to adapt analytic methods to “embarrassingly parallel” architectures
  • refusing to adapt methods for large-scale data when, for example, the data really aren’t as large as all that. Remember, many analyses can be run on a single machine with a few thousand dollars’ worth of RAM and disk, rather than requiring a compute cluster at orders of magnitude more expense. (Of course, projects like Apache Beam aim to bake in the ability to scale down, but this is by no means mature.)
  • pipeline audit capacity — visualization and other insight into data at intermediate stages of processing is more important the larger the scale of the data

Scientific honesty and client relationships

data scientists are in a uniquely well-suited position to actually improve the human quality of the “wet lab” research scientists they support.  By focusing on the data science in particular, they can:

  • identify publication bias, or other temptations like p-hacking, even if inadvertent (these may also be part of the statistical methods review above)
  • support good-faith re-analysis when mistakes are discovered in the upstream data, the pipelines or supporting packages: if you’re doing all the software work above, re-running should be easy
  • act as a “subjects’ ombuds[wo]man” by considering (e.g.) the privacy and reward trade-offs in the analytics workflow and the risks of data leakage
  • facilitate the communication within and between labs
  • find ways to automate the boring and mechanical parts of the data pipeline process

Mirrored from Trochaisms.

Comments for this post were disabled by the author