The Elements Of Statistical Learning

Welcome to another edition of KDnuggets' The Free eBook. Over the past few weeks we have spotlighted a different freely available publication in the world of data science, machine learning, statistics, etc. As long as readers continue to enjoy them, we will continue to showcase them on a weekly basis.

This week we bring you The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The first edition of this seminal work in the field of statistical (and machine) learning was originally published nearly 20 years ago, and quickly cemented itself as one of the leading texts in the field. The elements of statistical learning have not remained static over the intervening years, however, and so a second edition of the book was published in 2009. It is this second edition we discuss today, specifically its 12th printing from 2017.

First off, why \"statistical learning\" If you aren't aware of the term, or perhaps have only previously heard it used in this book's title, fret not. This is not some distinct domain of study far different than what you are currently learning or have been interested in. The quote below from the book's website can help put the term in perspective (emphasis added):

This book is about learning from data. In a typical scenario, we have an outcome measurement, usually quantitative (such as a stock price) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a training set of data, in which we observe the outcome and feature measurements for a set of objects (such as people). Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects. A good learner is one that accurately predicts such an outcome.

The Elements of Statistical Learning is quite literally about the application of new tools in the field of statistics to the process of learning, and building good learning models. If you are reading this article, or any article on KDnuggets, this is likely right up your alley.

The chapters each focus on a specific aspect of statistical learning of importance. Model Assessment and Selection, for example, is identified as important enough of a concept to be awarded its own chapter, which is both apt and refreshing. That this chapter appears early on after the introduction of a few chapters on modeling techniques is also worthy of note; relegating such a chapter to after the introduction of a series of classification techniques might mean that it is never reached by the reader, who may have already felt they had gotten everything they needed out of the book after learning the algorithms, which would be a real mistake.

All this is to say that the authors, who are also researchers and instructors, have an approach to how they are conveying their expertise. Their method seems to follow a logical ordered approach to what, and when, readers should be learning. However, individual chapters stand on their own as well, and so picking up the book and heading straight to the chapter on model inferences, for example, will work perfectly well, so long as you already have an understanding of what comes in the book before it.

To build up more on my theoretical background in these fields, I started reading Element of Statistical Learning (ESL) where I think it is known as the bible in statistical learning. I find its contents manageable.

Well the answer to the question is probably a matter of preference and depends on whether you want to specialize in some specific field (e.g. reinforcement learning) or you're aiming at having a more in-depth (but not limited to one subfield) view on machine learning.

The go-to bible for this data scientist and many others is The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Each of the authors is an expert in machine learning / prediction, and in some cases invented the techniques we turn to today to make sense of big data: ensemble learning methods, penalized regression, additive models and nonparemetric smoothing, and much much more.

In 2009, the second edition of the book added new chapters on random forests, ensemble learning, undirected graphical models, and high dimensional problems. And now, thanks to an agreement between the authors and the publisher, a PDF version of the 2nd edition is now available for free download.

The aim of the module is to introduce key statistical techniques for learning from data, mostly within the framework of Bayesian statistics. The module will cover linear models for regression and classification as well as more advanced approaches including kernel methods, graphical models and approximate inference.

This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

The paper reviews and highlights distinctions between function-approximation (FA) and VC theory and methodology, mainly within the setting of regression problems and a squared-error loss function, and illustrates empirically the differences between the two when data is sparse and/or input distribution is non-uniform. In FA theory, the goal is to estimate an unknown true dependency (or 'target' function) in regression problems, or posterior probability P(y/x) in classification problems. In VC theory, the goal is to 'imitate' unknown target function, in the sense of minimization of prediction risk or good 'generalization'. That is, the result of VC learning depends on (unknown) input distribution, while that of FA does not. This distinction is important because regularization theory originally introduced under clearly stated FA setting [Tikhonov, N. (1963). On solving ill-posed problem and method of regularization. Doklady Akademii Nauk USSR, 153, 501-504; Tikhonov, N., & V. Y. Arsenin (1977). Solution of ill-posed problems. Washington, DC: W. H. Winston], has been later used under risk-minimization or VC setting. More recently, several authors [Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1-50; Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. Springer; Poggio, T. and Smale, S., (2003). The mathematics of learning: Dealing with data. Notices of the AMS, 50 (5), 537-544] applied constructive methodology based on regularization framework to learning dependencies from data (under VC-theoretical setting). However, such regularization-based learning is usually presented as a purely constructive methodology (with no clearly stated problem setting). This paper compares FA/regularization and VC/risk minimization methodologies in terms of underlying theoretical assumptions. The control of model complexity, using regularization and using the concept of margin in SVMs, is contrasted in the FA and VC formulations. 59ce067264

https://www.stpaulsepiscopalchurch.com/group/our-youth/discussion/ebf1167b-e2ba-41e0-a905-c81206d5a129