Caleb

Stats Ph.D.

Class note: Stat 950 machine learning-theory and application (keep updating)

Caleb Jin / 2020-01-27


This is my class note of STAT 950 I took in spring 2020: machine learning - theory and application, taught by Dr. Weixing Song.

Lecture 1: Introduction

Machine Learning

Statistical Learning

Statistical Learning is a discipline of building mathematical, probability and statistical models based on data, using computational tools provided by computer science, to study data structures and make predictions.

In general, statistical (machine) learning consists of

  • Supervised learning: Developing predictive model based on both input and output data \((x_1,y_i)\)

    • Typical Examples: Regression, Classification, etc.
  • Unsupervised learning: Analyzing data structures (groups) based on input data \(x_i\)’s only. No output data are available.

    • Typical Examples: Clustering, Dimension Reduction, etc.
  • Reinforcement learning: The reinforcement learning method aims at using observations gathered from the interaction with the environment to take actions that would maximize the reward or minimize the risk. Reinforcement learning algorithm (called the agent) continuously learns from the environment in an iterative fashion. In the process, the agent learns from its experiences of the environment until it explores the full range of possible states. (Reinforcement Markov Process)

    • Typical Examples: Computer played board games (Chess, Go), robotic hands, and self-driving cars.

Three Basic Elements in Statistical Learning

All statistical learning methods consists of three components: 1. Model; 2. Strategy; 3. Algorithm.

Model

The first question in statistical learning is to decide which model we want to learn from data. We have to decide what hypothesis space to consider.

  • Decision space (Nonprobability Model):

\[\mathscr{F}=\{f: y = f(x)\}, \text{ or } \mathscr{F} = \{f: y = f_{\theta}(x), \theta\in \Theta\}.\]

  • Conditional probability space (Probability Model):

\[\mathscr{F} = \{P:P(y|x)\}, \text{ or } \mathscr{F} = \{P:P_{\theta}(y|x), \theta\in \Theta\}.\]

Strategy

The second question in statistical learning is to find a criterion to learn model.

  1. Loss/Cost Function: A nonnegative function measuring the loss or cost of wrong prediction.
  • 0 - 1 loss function:

\[\begin{equation} L(y,f(x))=\begin{cases} 1, & y \neq f(x) \\ 0, & y = f(x) \end{cases} \end{equation}\]

  • Square loss function:

\[L(y,f(x))=(y-f(x))^2.\]

  • Absolute loss function:

\[L(y,f(x))=|y-f(x)|.\]

  • Logarithmic loss function/log-likelihood loss function:

\[L(y,P(y|x)) = - \log P(y|x).\]

  1. Risk Function \(R_{exp}\) (Expected Loss)

Let \(P(x,y)\) be joint distribution of \((x,y)\). The risk function is defined as

\[R_{exp}(f)=E_P[L(y,f(x))]=\int_{\mathscr{X\times Y}}L(y,f(x))P(x,y)dxdy.\] e.g.  \[\begin{eqnarray*} E(y-f(x))^2 &=& \int\int(y-f(x))^2g(x,y)dxdy\\ &=& \int\int(y-f(x))^2g(x|y)dyh(x)dx\\ &=& \int\int[y-E(y|x)+E(y|x)-f(x)]^2g(x|y)dyh(x)dx\\ \end{eqnarray*}\] If \(f(x) = E(y|x)\), then \(R_{exp}\) is minimized.

  1. Empirical Risk \(R_{emp}\) (Empirical Loss) used in application

Let \(\mathscr{T}=\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}\) be the data set.

The Empirical risk is defined as

\[\begin{eqnarray} R_{emp}(f) = \frac{1}{n}\sum_{i=1}^{n}L(y_i,f(x_i)). \tag{1} \end{eqnarray}\]

  1. Learning Strategy

    • Empirical Risk Minization (ERM): choose \(f\in\mathscr{F}\) to minimize Eq. (1)

    • Structureal Rist Minization (SRM): choose \(f\in\mathscr{F}\) to minimize \[R_{emp}(f) = \frac{1}{n}\sum_{i=1}^{n}L(y_i,f(x_i)) + \lambda J(f),\]

    where
    \(J(f)\): complexity measurement of the model \(f\). More complexity of \(f\) means larger value of \(J(f)\).
    \(\lambda\): Tuning parameter balancing the trade-off between the empirical risk and the model complexity.

Note: More complexity of \(f\), smaller value of loss function.

Algorithm

Algorithm: How to solve the optimization question (including the analytic solution and the numerical solution as well as the computing platform, R or Python?)

Confusion Matrix

  • Accuracy: \[\frac{TP+TN}{TP+FP+FN+TN}\]

  • Precision:

\[P=\frac{TP}{TP+FP}\]

  • Recall (Sensitivity): \[R=\frac{TP}{TP+FN}\]

  • \(F_1\) score: \[\text{hamonic average: }F_1 = \frac{2}{1/P+1/R} \]

Generalization Error Bound

The generalization ability of a learning model measures its predictability of future data.

Generalization Error:

For the learned mode \(\hat f\), the generalization error is defined as

\[R_{exp}(\hat f) = E_PL(y,\hat f(x))=\int_{\mathscr{X\times Y}}L(y,\hat f(x))P(x,y)dxdy,\] where \(x\) represents the new data.

Generalization Error Bound for Binary Classification

Theorem 1 (Generalization Error Bound for Binary Classification) Consider a binary classification question,

  • Training data set: \(\mathscr{T}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\).
  • \(x\in\mathbb{R}^n, y\in \{-1,1\}\) and \((x,y)\sim P(x,y)\).
  • Hypothesis space \(\mathscr{F}=\{f_1,f_2,\dots,f_d\}\).
  • Loss function: 0-1 loss.

For any \(f\in\mathscr{F}\), with at least probability \(1-\delta,0<\delta<1,\) the following inequality holds

\[R(f)\leq \hat R(f)+\varepsilon(d,n,\delta),\] where \(R(f)=EL(y,f(x))\),

\[\hat R(f)=\frac{1}{n}\sum_{i=1}^n[Y_i-f(x_i)]^2,\] \[\varepsilon(d,n,\delta)=\sqrt{\frac{1}{2n}\left(\log d+\log \frac{1}{\delta}\right)}\]

Proof. To prove Theorem 1 is equivalent to prove \(P[\forall f\in \mathscr{F}, R(f)\leq \hat R(f)+\varepsilon(d,n,\delta)] \geq 1-\delta\).

For a given \(f\in\mathscr{F}, \hat R(f) = \frac{1}{n}\sum_{i=1}^{n} L(y_i;f(x_i), R(f)=E[\hat R(f)]\), by Hoeffding Inequality in Theorem 2: \[\begin{eqnarray} P[R(f) - \hat R(f)\geq \varepsilon]\leq \exp\left(-\frac{2n^2\epsilon^2}{n}\right) = \exp(-2n\varepsilon^2) \tag{2} \end{eqnarray}\] (Note: the denominator is \(n\) is due to \(X_i = L(y_i;f(x_i))\in\{0,1\}\), \(0\leq a_i\leq X_i\leq b_i=1\)).

From Eq. (2), we have \[\begin{eqnarray} P[\exists f\in \mathscr{F}: R(f) - \hat R(f)\geq \varepsilon] = P[\cup_{f\in \mathscr{F}} \{R(f) - \hat R(f)\geq \varepsilon\}]\leq \sum_{f\in \mathscr{F}}P[R(f) - \hat R(f)\geq \varepsilon]\leq d\exp(-2n\varepsilon^2). \end{eqnarray}\] Thus,

\[\begin{eqnarray} &&1-P[\exists f\in \mathscr{F}: R(f) - \hat R(f)\geq \varepsilon] \\ &=& P[\forall f\in \mathscr{F}: R(f) - \hat R(f)\geq \varepsilon]\\ &\geq& 1-d\exp(-2n\varepsilon^2) = 1 - \delta, \end{eqnarray}\] where \(\delta = d\exp(-2n\varepsilon^2)\). So \(\varepsilon = \sqrt{\frac{1}{2n}(\log d + \log \frac{1}{\delta})}\).
Theorem 2 (Hoeffding Inequality) Let \(X_1,\ldots,X_n\) be independent random variables with \(X_i\in[a_i,b_i], i = 1,2,\ldots,n\). \(\bar X\) is the sample mean. Then for any \(t>0\), we have \[\begin{eqnarray} P(\bar X - E(\bar X)\geq t) \leq \exp\left(-\frac{2n^2t^2}{\sum_{i=1}^{n}(b_i-a_i)^2}\right),\\ P(E(\bar X) - \bar X \geq t) \leq \exp\left(-\frac{2n^2t^2}{\sum_{i=1}^{n}(b_i-a_i)^2}\right) \end{eqnarray}\]