What’s this course about?

Gettting from raw data to useful models.

Administrative

Computer setup: Use whatever programming language you wish. Recommended: python, R. You are responsible for setting up your programming environment. It may be useful to use AWS, especially f you wish to experiment with a distributed computation environment (eg Spark) or GPUs.

Format: Course time will be a mix of lecture, discussion (usually in small groups), and hands-on time. Students should be prepared to work at a computer and to interact. Remote students should make sure they have a working microphone and camera.

Homework: Homework will be mostly projects. You will be expected to download data, ready it, and build models on it. You will turn-in a report and, for one project, give a short presentation in class. Some projects may be more development-focused. Additionally, you must give feedback on the first few lectures: two items, can be negative or positive or one of each. What works for you? What did you find interesting? What was confusing? What was awkward? What was unclear?

Textbook: We will not follow a textbook closely, but there will be some readings from the following:

Abbreviation	Title	Authors	Link
ESL	Elements of Statistical Learning, 2nd ed	Hastie, Tibshirani, Friedman	here
MMD	Mining of Massive Datsets, 2nd ed	Leskovec, Rajaraman, Ullman	here

Modeling and software packages: There are many types of statistical models and many software packages to fit them. This course is not based on around any one (though linear models, decision trees, and neural networks will often be used for discussion). I recommend working with whatever interests you and trying more than one.

About you

Please fill out a questionnaire

Data science principles

Always start simple, establish baselines.
Work iteratively: diagnose a weakness; come up with an improvement; implement it; test it; repeat.
Organize and automate entire data science pipeline. Avoid one-off attempts to improve model by, e.g., tweaking a hyperparameter.
More technology != better solution.
Big data != useful data.
Careful with overfitting
Don’t fool yourself: confirmation bias, statistical significance.