Data science is roughly split up into four areas:

  1. Programming
  2. Statistics
  3. Communication
  4. Domain Expertise

This document details how to get started with the first three. This is a very R-centric guide because I live an R-centric life. Free resources are strongly favored because I’m a graduate student.

Programming

Programming allows you to articulate yourself in the dimension of computational action. Serious programmers are taken seriously, so you should aim to become one.

Statistics

Statistics will help you quantify uncertainty. 90% of what you need to understand falls into the following four categories:

  1. Probability and Combinatorics
  2. The Law of Large Numbers and the Central Limit Theorem
  3. Regression
  4. The Bootstrap

Communication

People

The most valuable and important part of the data science profession are the folks who enable the empathetic, conscientious, and delightful community. Many great data science conversations take place on Twitter, and there’s a friendly and active discussion around the #rstats hashtag. The following are short descriptions of some awesome people in the community plus links to their Twitter profiles which are usually good portals to their personal websites, portfolios, and other examples of their work.

Tools

Once you’ve learned R or Python and a little bit about how to use the command line you will be able ot use some powerful tools for communicating your ideas.

  • R Markdown allows you to create documents, reports, websites, and web applications. This site was created with R Markdown.
  • Jupyter allows you interactively develop a data analysis much like R Markdown. Especially popular in the Python community.
  • Git and GitHub allow you to share code and publish websites for free. Getting started with both can be a little difficult, but there are many fine tutorials available, including one that I wrote as a part of The Unix Workbench.

Online Courses

I spent two years working in the Johns Hopkins Data Science Lab developing courses in data science. There are two programs that I endorse: