[CS229] Lecture 6 Notes - Support Vector Machines I

Mar. 05, 2019 - Tuesday

Notes

cs229

machine learning

svm

Introduce Support Vector Machines (SVM)

Created on 02/27/2019

Updated on 03/04/2019

Updated on 03/05/2019

1. Margins: Intuition
2. Change of Notation (from logistic regression)
3. Function Margins / Geometric Margins
- 3.1. Function Margins
- 3.2. Geometric Margins
  - Derivation of geometric margins:
4. The Optimal Margin Classifier
- 4.1. Intuition
- 4.2. Describe the aim in math:
5. Lagrange Duality
Reference

Support Vector Machine (SVM) learning algorithms are among the best “off-the-shelf” supervised learning algorithm.

1. Margins: Intuition

(03/04/2019)

The intuition is to find a decision boundary that allows us to make all correct and confident (meaning far from the decision boundary) predictions on the training examples.

2. Change of Notation (from logistic regression)

In SVM discussion, we use $y \in \{-1, 1\}$ instead of $\{0, 1\}$ to denote the class labels.
Rather than parameterizing linear classifier with $\theta$, we use $w, b$ as parameters, and our classifier is written as:

$h_{w,b}(x) = g(w^T x + b).$

3. Function Margins / Geometric Margins

3.1. Function Margins

Definition of the functional margin of $w,b$ w.r.t. the training example: $\hat \gamma^{(i)} = y^{(i)}(w^Tx+b).$

where $y^{(i)} \in \{-1, 1\}$;
if $y^{(i)}(w^Tx+b) > 0$, then our prediction on this example is correct!

Given a training set $S = \{(x^{(i)}, y^{(i)}); i = 1, \ldots, m\}$, we also define the function margin of $(w,b)$ w.r.t. $S$ as the smallest of the functional margins of the individual training examples. Denoted by $\hat{\gamma}$:

$\hat \gamma = \min_{i=1, \ldots, m} \hat \gamma^{(i)}.$

3.2. Geometric Margins

(03/04/2019)

Consider the figure below:


Figure 1. Geometric Margin.

The distance from a training sample $x^{(i)}$ to the decision boundary ($w^Tx+b=0$) is denoted as $\gamma^{(i)}$, it’s value is given by the line “AB” in Figure 1.

Derivation of geometric margins:

Point “B” is given by $x^{(i)} - \gamma^{(i)} \cdot w / \Vert w \Vert$.
Point “B” is on the decision boundary $w^Tx+b=0$. Hence,

$w^T\left(x^{(i)} - \gamma^{(i)} \cdot \frac{w}{\Vert w \Vert}\right)+b=0$

Solving for $\gamma^{(i)}$, we have:

$\gamma{(i)} = \frac{w^Tx^{(i)} + b}{\Vert w \Vert} = \left(\frac{w}{\Vert w\Vert}\right)^T x^{(i)} + \frac{b}{\Vert w\Vert}.$

In order to let both “positive” and “negative” samples satisfy, we define the geometric margin of $(w,b)$ w.r.t. a training sample $(x^{(i)},y^{(i)})$ to be:

$\gamma^{(i)} = y^{(i)}\left(\left(\frac{w}{\Vert w\Vert}\right)^T x^{(i)} + \frac{b}{\Vert w\Vert}\right).$

Finally, given a training set $S = \{(x^{(i)}, y^{(i)}); i = 1, \ldots, m\}$, we also define the function margin of $(w,b)$ w.r.t. $S$ as the smallest of the geometric margins of the individual training examples:

$\gamma = \min_{i=1,\ldots,m}\gamma^{(i)}$

4. The Optimal Margin Classifier

4.1. Intuition

Now, our aim is to try to find a decision boundary that maximizes the (geometric margin), since this would reflect a very confident set of predictions on the training set and a good “fit” to the training data.

Specifically, this will result in a classifier that separates the positive and the negative training examples with a “gap” (the geometric margin).

4.2. Describe the aim in math:

Assume that we are given a training set that is linearly separable. Then, our aim is to find the hyperplane $(w,b)$ to maximize the geometric margin. The optimization problem can be written as:

$\begin{align} \max_{\gamma,w,b} &\quad \gamma \\ \mbox{s.t.} &\quad y^{(i)}\left(w^T x^{(i)} + b\right) \ge \gamma,\quad i=1,\ldots,m \\ &\quad \Vert w\Vert = 1. \end{align}$

If we could solve the optimization problem, we’d be done. However, “$\Vert w\Vert=1$” constraint is nasty(non-convex), so we try to transform the problem into a nicer one:

$\begin{align} \max_{\hat\gamma,w,b} &\quad \frac{\hat\gamma}{\Vert w\Vert} \\ \mbox{s.t.} &\quad y^{(i)}\left(w^T x^{(i)} + b\right) \ge \hat\gamma,\quad i=1,\ldots,m \\ &\quad \Vert w\Vert = 1. \end{align}$

This time, we’re goint to maximize $\hat\gamma/\Vert w\Vert$, subject to the functional margins all being at least $\hat\gamma$. However, this objective function $\hat\gamma/\Vert w\Vert$ is still nasty (non-convex). Therefore, we have to continue to find a better way.
The key idea we use here: we can add arbitrary scaling constraint on $w$ and $b$ without changing anything. Here, we introduce the scaling constraint that the functional margin of $w,b$ w.r.t. the training set must be 1:

$\hat\gamma = 1.$

Plugging this into our problem above and noting that maximizing $\hat\gamma/\Vert w\Vert=1/\Vert w\Vert$ is the same thing as minimizing $\Vert w\Vert^2$, we now have the following optimization problem:

$\begin{align} \min_{w,b} &\quad \frac{1}{2}\Vert w\Vert^2 \tag{a}\label{eqa} \\ \mbox{s.t.} &\quad y^{(i)}\left(w^Tx^{(i)} + b\right) \ge 1, \quad i=1,\ldots, m \tag{b}\label{eqb} \end{align}$

Now, the above is an optimization problem witha convex quadratic objective and only linear constraints. It’s colution gives us the optimal margin classifier.

In the following, the aim is to find a good way to solve this “constrained optimization problem”.

5. Lagrange Duality

The method of Lagrange multipliers can be used to solve constrained optimization problems (like the problems shown in the last part).

Consider a problem of the following form:

$\begin{align} \min_w &\quad f(w) \\ \mbox{s.t.} &\quad h_i(w)=0, \quad i=1, \ldots, l. \end{align}$

To solve this constraint optimization problems, we define the Lagrangian:

$\mathcal{L}(w,\beta)=f(w)+\sum_{i=1}^l \beta_i h_i(w)$

Here, the $\beta_i$’s are called the Lagrange multipliers. Then we let $\mathcal{L}$’s partial derivatives to be zero:

$\frac{\partial\mathcal{L}}{\partial w_i}=0;\quad \frac{\partial\mathcal{L}}{\partial \beta_i}=0,$

and solve for $w$ and $beta$, and we’d be done.

In this section, we will generalize this to constrained optimization problems in which we may have inequality as well as equality constraints.

Consider the following, the primal optimization problem:

$\begin{align} \min_w &\quad f(w) \\ \mbox{s.t.} &\quad g_i(w) \le 0, \quad i=1,\ldots,k \\ &\quad h_i(w)=0,\quad i=1,\ldots,l. \end{align}$

To solve it, we start by defining the generalized Lagrangian:

$\mathcal{L}(w,\alpha,\beta) = f(w) + \sum_{i=1}^k \alpha_i g_i(w) + \sum_{i=1}^l \beta_i h_i(w).$

Here, the $\alpha$’s and $\beta$’s arethe Lagrange multipliers. Consider the quantity:

$\theta_{\mathcal{P}}(w) = \max_{\alpha,\beta:\alpha_i \ge 0} \mathcal{L}(w,\alpha,\beta).$

Here, the “$\mathcal{P}$” subscript stands for “primal”. Let some $w$ be given, if $w$ voilates any of the primal constraints (i.e., if either $g_i(w) \ge 0$ or $h_i(w) \ne 0$ for some $i$), then you should be able to verify that:

$\begin{align} \theta_{\mathcal{P}}(w) &= \max_{\alpha,\beta:\alpha_i \ge 0} f(w) + \sum_{i=1}^k \alpha_i g_i(w) + \sum_{i=1}^l \beta_i h_i(w) \tag{1}\label{eq1} \\ &= \infty \tag{2}\label{eq2} \end{align}$

Conversely if the constraints are indeed satisfied for a particular value of $w$, then $\theta_\mathcal{P}(w)=f(w)$. Hence,

$\theta_{\mathcal{P}}(w) = \left\{ \begin{array}{ll} f(w) & \mbox{if} \ w\ \mbox{satisfies primal constraints} \\ \infty & \mbox{otherwise.} \end{array}\right.$

Reference

» Stanford Lecture Note Part V