Data science is roughly split up into four areas:
- Programming
- Statistics
- Communication
- Domain Expertise
This document details how to get started with the first three. This is a very R-centric guide because I live an R-centric life. Free resources are strongly favored because I’m a graduate student.
Programming
Programming allows you to articulate yourself in the dimension of computational action. Serious programmers are taken seriously, so you should aim to become one.
- Learn Python the Hard Way by Zed Shaw. Thousands of people have used this book to learn how to code for the first time. Teaches you many fundamental programming concepts, and Python is an essential language to know for anyone who writes code.
- The Unix Workbench by me. A book I wrote for beginners about how to use the command line, another baseline skill for a data scientist. Read to the end of the third chapter before starting Learn Python the Hard Way, then come back if you feel like you want more command line skills.
- R Programming for Data Science by Roger Peng. Learn R deeply to really start your journey towards specializing as a data scientist.
- R for Data Science by Hadley Wickham and Garrett Grolemund. Modern data scientific paradigms applied in R based in the philosophy of Tidy Data.
Statistics
Statistics will help you quantify uncertainty. 90% of what you need to understand falls into the following four categories:
- Probability and Combinatorics
- The Law of Large Numbers and the Central Limit Theorem
- Regression
- The Bootstrap
Communication
People
The most valuable and important part of the data science profession are the folks who enable the empathetic, conscientious, and delightful community. Many great data science conversations take place on Twitter, and there’s a friendly and active discussion around the #rstats hashtag. The following are short descriptions of some awesome people in the community plus links to their Twitter profiles which are usually good portals to their personal websites, portfolios, and other examples of their work.
- Renee Teate is a poineer in terms of documenting her transition into a data science career, and helping others who want to enter the field. She is the creator of Data Sci Guide and the host of the Becoming a Data Scientist Podcast.
- Julia Silge is a data scientist at Stack Overflow. Julia transitioned in to data science after working in educational technology and earning her PhD in Astronomy. She and David Robinson are the authors of Text Mining with R.
- Hadley Wickham is the chief scientist at RStudio. Hadley creates some of the most widely used tools for data scientists. He is also an advocate for newcomers to R.
- Jesse Maegan is a data scientist and the creator of the “R for Data Science” online learning community.
- Alice Daish is a data scientist at The British Museum, which sounds like the absolute coolest job in the world. She’s one of the leaders of R-Ladies Global.
- Mara Averick is a data scientist and a one-stop-shop for the latest technologies and trends that actually affect folk’s data scientific workflows. Her writings and tweets could be made into a history of how data science tools have evolved.
- Roger Peng is a professor of biostatistics, author of a myriad of data science books and courses, co-founder of the Johns Hopkins Data Science Lab, and co-host of the podcasts Not So Standard Deviations and The Effort Report.
- Hilary Parker is a data scientist at Stitch Fix and the co-host (with Roger) of Not So Standard Deviations. Hilary is one of the field’s leading thinkers in the development of data analyses.
- David Robinson is a data scientist at Stack Overflow, an author, and an accidental presidential historian.
- Amelia McNamara is a professor if statistical and data sciences at Smith College. Amelia has an inspiring vision for the future of statistical programming. She also builds incredible explorable explanations.
- Stefanie Butland is a bioinformagician, expert people-connector, and community manager of rOpenSci, an organization which develops software to enable data sharing and open science.
Online Courses
I spent two years working in the Johns Hopkins Data Science Lab developing courses in data science. There are two programs that I endorse:
- The Data Science Specialization. Nine courses in data science followed by a capstone project with an industry partner, all designed by the team at Johns Hopkins.
- DataCamp Interactive R tutorials in your web browser. Designed by friends of the lab.