Thoughts on Teaching Statistical Computing

We need courses on development

There’s a lot of talk about statistics vs machine learning and a lot of people on Team Statistics grumble about “how machine learning is just statistics with better marketing.” Here’s evidence from a Reddit post:

stat-ml-meme

It’s easy to dismiss machine learning as some sort of glorified/bastardized statistics, or an invasive species encroaching on statistics territory. But I think it’s we took it seriously and thought about what makes them look so effective and efficient at data science. Is it really just good marketing?

Development Tools

Statistics and machine learning, over the years, have become so vast and complex that we needed better tools. First, computers were real humans (mostly women back in the day). Then came Microsoft Excel. Now, everyone in the field is required to know computer programming to some extent. In all honesty, there are so many study groups and meetups, knowing how to use R or Python no longer counts as being an expert in data science.

There are two obvious ways the field can expand:

  1. Teaching the theory of statistical algorithms.
  2. Teaching how make the tools used to analyze data.

By “expand” I mean putting together a series of courses that would prepare students to build expertise in the subject. I think a lot of statistics departments go the first route, but rarely the second. For example, to go the second route, there need to be courses from developing simple packages (either R or Python but preferably a lower-level language like C++), collaboration tools like Git, and advanced programming skills like C++ templating or software design. Of course, building a sophisticated tool that tackles a complex problem requires a combination of skills that rarely live in a single human being, which is why commitment to open source and contributing to the open-source community benefits everyone. (Look at the people who led, contributed to, and supported the development of Rstudio.) But that’s also a critical reason we need to be exposed to these things early on in college education.

As part of hedging, I acknowledge that I have zero experience putting together these courses and recruiting qualified people to lead the course offerings. So maybe there are reasons unknown to me that make this sort of thing difficult. Also, different universities have different foci, so this may not be their top priority. I simply hope to bring attention to the development side of things and wish the statistics community, especially in higher education, would invest more in it. After all, machine learners, being the computer scientists that they are, make better tools and that puts them at an advantage in this race.

Daeyoung Lim
Daeyoung Lim
Statistics PhD Candidate

My research interests include Bayesian statistics, biostatistics, and computational statistics. I’m an English grammar fiend and a staunch proponent of plain language.