Linear Classifiers

From CS 231N A linear classifier works like this:

A neural network (a single linear layer) produces a score, and a loss function maps this score to an ‘agreement’ value with the class label.
We minimize the loss with respect to the parameters of the score function
If you have $ W\mathbf{x} + b$, that is like evaluating $K$ classifiers separately one for each class. Each classifier is a row of $W$.
- This is like template matching using the dot product, or ‘one template’ KNN - where there is one image per class, and distance is the dot product, not L2 or L1 distance.
- It is much faster than KNN since you do not have to compute distances to all the training set points.
The weights indicate the directionality of the relationship, for each pixel - like a positive weight on blue means we want that pixel to be blue.
Each row of $W$ is a hyperplane, with a normal vector indicating the direction of increase, and “on the plane” meaning 0.
- (Is the template in the same direction as the average image?)
if you have multiple layers, maybe earlier layers detect specific cars (e.g. green, blue) and the NN is a weighted sum of individual car detectors.
zero mean centering is more important than scaling (why)?

SVM

Wants corrrect class to have a bigger score than all the incorrect classes by some margin $\Delta$. $L_i = \sum_{j\neq y_i} \max (0, s_j - s_i, \delta)$
There is a loss on all the scores that are not class $i$, and the loss is how much bigger they are than class $i$’s score minus $\Delta$.
If the score for the incorrect class is less than the correct class by $\Delta$, there is no loss here!
There is also quadratic hinge loss, where each term in the sum is squared

Weight magnitude is underdetermined, if you class all correctly, weight can be any scasle
Normalize with respect to L2 Norm
$L = \frac{1}{N} \sum_i L_i + \lambda R(W) $
Don’t regularize biases
Improves generalization by requiring dependency on all inpus

We can set $\Delta$ to $1.0$ and only tune $\lambda$, due to the weight magnitude thing.

$L_i = - \log \frac{e^{s_{y_i}}}{\sum_k e^{s_k}}$ $= - s_{y_i} + \log \sum_k e^{s_k}$

Scores are interpreted as unnormalized log probabilities
shift by max
Regularization is a Gaussian Prior on the weight matrix.
Large $\lambda$ means more diffuse probabilities
Perfomance difference between SVM and softmax is small.
SVM is more local - it stops trying onces scores are good enough, only care about scores near the margin.
A car classifier shouldn’t focus on lower the probability of classifying ducks as cars even more, it should focus on distinguishing between cars and trucks.
this can be an arg in favor of SVM.