Wednesday, January 11, 2017

How a machine learns (Or: It's the derivative, stupid)


Before I start, let me just say how excited I am about this course I'm sitting in on at MIT this week. I'll try my best to sort of repeat the most interesting tidbits I learn in a future post. I'm starting to worry about how long these posts are taking me, though, so for now, let me just provide a link to this frickin' awesome toy model of a self-driving car. Click the "Run Training" button, and in 10 seconds a neural net (using reinforcement learning) teaches the toy car to "drive" (BTW, the code runs in the browser--I mean WOW.) 

I haven't personally explored the reading material on the Resources tab there, but it all looks useful. Also: the Deep Learning book (Goodfellow, Bengio, Courville) mentioned on the Resources tab was independently recommended by mathematical little bro' Jon Bloom. And two independent recommendations implies it must be good. That's like a theorem, right?

(Umm....no 'tisn't. But I do trust JB's recs about this stuff, so feel confident passing them on.)

OK, back to basics. Last time, I talked about a particular type of problem a machine learning algorithm might want to tackle:

One has a bunch of data points in some high-dimensional parameter space $\mathbb{R}^n$ (where the $n$ parameters are whatever some particular human has decided would be useful to determine the answer to some yes/no question like "Is a particular e-mail spam?"), and these data points have been "pre-classified." That is, each data point has already been assigned a "yes/no" label (e.g., a human has gone through the data somehow and labeled each data point manually as "Yes-spam" or "No-not spam"). Here's a picture of what this might look like in a 2-dimensional parameter space (Red = "Y", Pencil = "N"). Sorry this is so low-tech, but I'm just a mathematician:



Now suppose we (humans) guess that the data is linearly separable, that is we guess that the decision boundary between the "Y" region and the "N" region of the parameter space is just a hyperplane determined by some orthogonal vector $\vec{a} \in \mathbb{R}^n$ and shift $b \in \mathbb{R}$. (A reasonable assumption in the silly example pic above.)

The canonical first example of a machine learning algorithm (which you may as well think of a computer program) is one that will use

  1. our pre-labeled data and
  2. our assumption that the data is linearly separable
to determine the decision boundary. Note that our choice to assume that our decision boundary is a hyperplane tells us that the output of our machine learning algorithm should just be the parameters $\vec{a} \in \mathbb{R}^n$ and $b$ specifying the hyperplane.

OK, so what should the machine do?

Well...what would a human do?

Cost function

That's right! Start by guessing something, then keep trying to improve. Of course, this begs the question, "Improve what?" Without getting too philosophical about it, we need to define some cost function whose input is

  1. our pre-labeled data and
  2. our current guess $\vec{a} \in \mathbb{R}^n$ and $b \in \mathbb{R}$

and spits out a number that we refer to as the cost associated to the guess $\vec{a}, b$. The cost function should be defined in a reasonable enough way that we really feel it's measuring how well we're doing. (High cost = bad.) It should also treat all data points equally. That is, it should be an average over individual cost functions for each data point. (It also helps if it's easy to compute its partial derivatives with respect to the parameters, but we're getting ahead of ourselves.)

Minor but Important notational digression

Remember we have our space $\mathbb{R}^n$ that parameterizes our data (in our spam example, points $\vec{x}$ in this space represent e-mails). But we also have our space $\mathbb{R}^{n+1}$ that parameterizes the possible separating hyperplanes (points $(b, \vec{a})$ in this space represent shifted hyperplanes in the data-parameterizing space $\mathbb{R}^n$).

At this point, it is notationally convenient to imbed the data-parameterizing space $\mathcal{D} \cong  \mathbb{R}^n$ into $\mathcal{D}' \cong \mathbb{R}^{n+1}$ via the map \[\vec{x} := (x_1, \ldots, x_n) \mapsto \vec{x}' := (1,x_1, \ldots, x_n).\] Doing this imbedding allows us to treat the shift (often called the bias in the literature) $b$ associated to a hyperplane on equal footing with the components $a_i$ of the vector $\vec{a} = (a_1, \ldots, a_n)$ specifying its orthogonal direction.

Now suppose we have a point $\vec{a}' := (a_0, a_1, \ldots, a_n) \in \mathbb{R}^{n+1}$. Then remembering that if $\vec{x} \in \mathcal{D}$ then $\vec{x}' = (1, \vec{x})$ denotes its imbedding as the $1$ level in $\mathcal{D}'$, the hyperplane in $\mathcal{D}$ determined by $\vec{a}'$ is the set of points $\vec{x} \in \mathcal{D}$ for which $\vec{a}' \cdot \vec{x}' = 0$ (this is the hyperplane specified by orthogonal vector $(a_1, \ldots, a_n)$ and shift $-a_0$).

Cross-entropy cost function

A commonly used cost function for a problem of this type is the cross-entropy cost function. Before describing it, let me come clean and admit that because of my lack of prior training in statistics I don't yet feel in my bones why this is the "right" choice. I will say that Chapter 2: Classification and Logistic Regression in Andrew Ng's CS229 lecture notes here helped me, as did the discussion of the cross-entropy cost function in Chapter 3 of Michael Nielsen's Deep Learning book (esp. the animations). I may try to write more about this later when I'm smarter.

Relative ignorance disclosed, I proceed:

Remember that for each data point $\vec{x}$, we have a "Y/N" label. Let's turn these into "1/0" labels. That is, for each data point $\vec{x}$, define
\[y(\vec{x}) = \left\{\begin{array}{cl} 1 & \mbox{if the label on } \vec{x} \mbox{ is ``Y"}\\
0 & \mbox{if the label on } \vec{x} \mbox{ is ``N"}\end{array}\right.\]

Remember also that for a particular choice of parameters $\vec{a}' \in \mathbb{R}^{n+1}$ specifying the separating hyperplane (see notational digression above), the output guessed by the model on a point $\vec{x} \in \mathbb{R}^n$ is \[\sigma(\vec{a}' \cdot \vec{x}') \in (0,1),\] where $\sigma$ is the sigmoid function  \[\sigma(t) = \frac{1}{1 + e^{-t}}.\] We can (and should, I suppose) think of \[h_{\vec{a}'}(\vec{x}) := \sigma(\vec{a}'\cdot \vec{x}')\] as the probability that the label on $\vec{x}$ is "Y" (according to the model specified by the vector of parameters $\vec{a}' = (a_0, a_1, \ldots, a_n)$). 

The cross-entropy "cost" or "loss" associated to hyperplane parameters $\vec{a}'$ and a particular data point $\vec{x}$ is then defined as: \[J_{\vec{a}'}(\vec{x}) := -y(\vec{x})\ln(h_{\vec{a}'}(\vec{x})) - (1 - y(\vec{x})) \ln(1-h_{\vec{a}'}(\vec{x})),\] and if we have $N$ pre-labeled data points $\vec{x}^{(1)}, \ldots, \vec{x}^{(N)}$, we define the cost to be the average of the individual costs:
\[J_{\vec{a}}\{\vec{x}^{(1)}, \ldots, \vec{x}^{(N)}\} := \frac{1}{N}\sum_{i=1}^N J_{\vec{a}'}(\vec{x}^{(i)}).\]

Let's pause for a sec, because it's worth noting how much less complicated this function is than it appears at first glance. Remember that $y(\vec{x})$ is only ever $0$ or $1$, so for each data point $\vec{x}^{(i)}$ one of the two terms in the sum drops out and the other is just the negative of the natural log of an expression involving the difference between the actual label (which is either $0$ or $1$) and the predicted label (which is some real number between $0$ and $1$). As far as I can tell, the main reason for the appearance of the natural log here is because:

  1. The more wrong you are, the more you are penalized: Think of the behavior of $-\ln(t)$ for values of t between 0 and 1. Its main feature is that as $t$ goes to $0$, $-\ln(t)$ goes to $\infty$, and as $t$ goes to $1$, $-\ln(t)$ goes to $0$. As a result, when the difference $y(\vec{x}) - h_{\vec{a}'}(\vec{x})$ between the actual label and the predicted label is large (close to $1$), then $1$ minus this difference is close to $0$, so when we take $-\ln$ of this number, we get something huge. The hugeness goes to $0$ as the discrepancy between the actual label and the predicted label goes to $0$.
  2. Having $\ln$ in the expression makes it easier to take partial derivatives of the cost function: Read the next section to understand why this is.


Gradient descent

Now that we have a way of measuring how well we've done for a particular choice of parameters $\vec{a}'$, calculus tells us how we might do a bit better (at least nearby).

Remember we have our space $\mathcal{D} = \mathbb{R}^n$ that parameterizes our data, and we have our space $\mathbb{R}^{n+1}$ that parameterizes the possible separating hyperplanes. We now want to think of our cost function as a function from our hyperplane-parameterizing space to $\mathbb{R}$ and locate the point(s) in our hyperplane-parameterizing space that minimize cost.

Here is where calculus helps us. Remember that if we have a real-valued smooth function of several parameters and we are interested in finding an absolute minimum of the function (and our domain is boundary-less), we know for sure that this absolute minimum will occur at a local minimum of the function. And every local minimum of the function will in particular have the property that all of the partial derivatives of the function vanish there.

This tells us that to find an absolute minimum of the cost function we should look for places where the partial derivatives of the cost function vanish (aka critical points of the function). As a first pass, this is a great idea as long as we remember that not every critical point is a local minimum, and not every local minimum is an absolute minimum. If you like Morse theory, you don't need to be convinced of those two statements. If you don't yet like Morse theory, the following two pics may convince you (again, low tech--sorry):

A portion of the graph of a real-valued function whose domain is 2-dimensional.
The partial derivatives at the black dot vanish, but the black dot is not a local minimum.


A portion of the graph of a real-valued function whose domain is 1-dimensional.
The black dot is a local minimum but not an absolute minimum.

Since it is not always easy (read: computationally advantageous) to find the critical points of a cost function directly, what is done in practice is to compute the gradient of the cost as a function of the hyperplane-parameterizing space: \[\nabla(J) := \left(\frac{\partial J}{\partial a_0}, \frac{\partial J}{\partial a_1}\ldots, \frac{\partial J}{\partial a_n},\right)\] and take small-ish steps in the negative gradient direction until your gradient gets close enough to $\vec{0}$ that you're pretty darned sure you're close to a critical point. This is called gradient descent, and it's what puts the "learning" in "machine learning." To first order, it's really that simple.

(WARNING: Keep your wits about you. The gradient of the cost function is computed with respect to the hyperplane-parameterizing space NOT the data-parameterizing space. The machine is trying to learn the best hyperplane to fit the data, which it is treating as a constant.)

Of course, one of the big problems with gradient descent is that unless you happen to know in advance that your function has only one critical point and that it is definitely a local minimum, then the result of performing gradient descent will not necessarily find you a global minimum. I'll probably try to talk about this at greater length later. Let's ignore it for now, compute the gradient of the cross-entropy cost function, and call it a day.

Gradient of cross-entropy cost function (+ a useful property of the sigmoid function)

Now we just dust off the ole' chain rule, keeping in mind the following useful observation about the sigmoid function $\sigma(t) := \frac{1}{1 + e^{-t}}$: \[\frac{d\sigma}{dt} = \sigma(t)(1 - \sigma(t)).\] This fact also uses the chain rule in the first step. The rest is just algebra:
\begin{eqnarray*}
 \frac{d\sigma}{dt} &=& \frac{e^{-t}}{(1+e^{-t})^2}\\
 &=& \frac{1}{1+e^{-t}}\cdot\frac{(1+e^{-t}) - 1}{1 + e^{-t}}\\
 &=& \sigma(t)(1 - \sigma(t))
\end{eqnarray*}

Now it will turn out that \[\frac{\partial J}{\partial a_j} = \frac{1}{N} \sum_{i=1}^N (h_{\vec{a}'}(\vec{x}^{(i)}) - y^{(i)})x_j^{(i)},\] where we recall that $h_{\vec{a}'}(\vec{x}) := \sigma(\vec{a}' \cdot \vec{x}')$. This derivation has some slightly non-obvious steps, so I include it here. Note that the partial derivative of a sum is the sum of partial derivatives, so we just need to verify that the partial derivative $\frac{\partial J}{\partial a_j}$ associated to a particular data point $\vec{x}^{(i)}$ has the form we see inside the summation above. If $y(\vec{x}^{(i)}) = 1$, the chain rule and the fact above about the derivative of the sigmoid function tells us this is:

\begin{eqnarray*}
 &=& -y(\vec{x}^{(i)}) \cdot \frac{1}{\sigma(\vec{a}'\cdot(\vec{x}^{(i)})')}\cdot \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')\cdot (1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
 &=& y(\vec{x}^{(i)}) \cdot (\sigma(\vec{a}'\cdot(\vec{x}^{(i)})') - 1) \cdot x_j^{(i)}\\
&=& (h_{\vec{a}}(\vec{x}^{(i)}) - y(\vec{x}^{(i)})) \cdot x_j^{(i)}
\end{eqnarray*}

where in the last step we are using the assumption that $y(\vec{x}^{(i)}) = 1$. Similarly, if $y(\vec{x}^{(i)}) = 0$, the summand reads:

\begin{eqnarray*}
 &=& (1-y(\vec{x}^{(i)})) \cdot \frac{1}{1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')}\cdot \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')\cdot (1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
 &=& (1- y(\vec{x}^{(i)})) \cdot (\sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
&=& (h_{\vec{a}}(\vec{x}^{(i)}) - y(\vec{x}^{(i)})) \cdot x_j^{(i)}
\end{eqnarray*}

where again in the last step, we are using the assumption that $y(\vec{x}^{(i)}) = 0$.

OK, I'm tired now and probably you are too. I'm also sure the equations here are riddled with sign errors and other less forgivable mistakes. Please tell me if you notice any (forgivable or not). I'll try to have fewer equations in later posts.