ABOUT THIS COURSE
This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes:
- linear and polynomial regression, logistic regression and linear discriminant analysis;
- cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso);
- nonlinear models, splines and generalized additive models;
- tree-based methods, random forests and boosting;
- support-vector machines.
Some unsupervised learning methods are discussed:
- principal components and clustering (k-means and hierarchical).
This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.
The lectures cover all the material in An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013). The pdf for this book is available for free on the book website.
Đăng kí (free): link
Statistical Learning versus Machine Learning
• Machine learning arose as a subfield of Artificial Intelligence.
• Statistical learning arose as a subfield of Statistics.
• There is much overlap — both fields focus on supervised and unsupervised problems:
- Machine learning has a greater emphasis on large scale applications and prediction accuracy.
- Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
• But the distinction has become more and more blurred, and there is a great deal of “cross-fertilization”.
• Machine learning has the upper hand in Marketing!
Comparison of methods in Machine Learning
Unsupervised vs Supervised Learning
Supervised learning methods such as regression and classification. In that setting we observe both a set of features X1, X2, . . . , Xp for each object, as well as a response or outcome variable Y . The goal is then to predict Y using X1, X2, . . . , Xp.
Unsupervised learning, we where observe only the features X1, X2, . . . , Xp. We are not interested in prediction, because we do not have an associated response variable Y. The goal is to discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? We discuss two methods: principal components analysis & clustering.