Sunday, November 5, 2017

Neural networks extend boolean functions from the vertices to the interior of the unit n-cube

I'm fired up because I started a learning seminar on the mathematics of neural networks this semester at BC. Jeffery Breeding-Allison, a visiting assistant professor here who did the Insight Data Science Fellows program last summer, is my co-organizer.

We're focusing not as much on the practical computational aspects of neural networks as on the theoretical computer science/mathematical aspects. It's not that the practical aspects aren't important (they are, of course). It's just that at the moment I want to learn about the theoretical aspects.

And I do what I want. (Not really, but doesn't that sound fun?)

Before I start in on the math, let me just say: Man, it's nice to be talking regularly about this stuff with actual people in real life. Much of the perspective I take below (especially the connection to classical Morse theory) arose during chats with Nate B.

My goal for today is to convince you of the following cool fact.

Cool Fact 1 (CF1): Let $m,n \in \mathbb{Z}^+$. Any boolean function $f: \{0,1\}^n \rightarrow \{0,1\}^m$ can be approximated arbitrarily well by a finite feedforward neural network with sigmoid activation functions. 

More precisely, there exists a neural network with an $n$-node input layer and an $m$-node output layer whose associated smooth function $F: \mathbb{R}^n \rightarrow (0,1)^m$ can be extended continuously to a function $F: \mathbb{R}^n \rightarrow [0,1]^m$ that (approximately!) agrees with $f$ on the vertices, $\{0,1\}^n$, of the unit $n$-cube.

I should note first of all that this fact has been known for decades. Indeed, it is implied by the much harder universal approximation theorem for sigmoid feedforward neural networks (the google tells me this is attributed to Cybenko and Funahashi, who gave independent proofs both published in 1989). The universal approximation theorem says that any smooth function from $(0,1)^n \rightarrow (0,1)^m$ can be approximated arbitrarily closely by a sufficiently large feedforward neural network with sigmoid activation functions.

What I like about the more modest statement in bold (CF1) above is that
  • (CF1) is enough to tell you that, in principle, a sufficiently large trained neural network can at least do the sorts of calculations a computer program can. After all, at heart these calculations are nothing more than boolean functions.
  • Finding a sensible way to extend a boolean function from the vertices of an $n$-cube to its interior is a beautiful thing to do, since you are essentially finding a probabilistic function that extends a deterministic one. That is, you're replacing binary ("Yes/No") inputs/outputs with inputs/outputs in the interval $(0,1)$ ("maybe Yes, with probability $p \in (0,1)$").
  • Understanding why (CF1) is true gets you a long way towards a rudimentary picture of how a neural network is "reasoning". And that's what we're after.
As a side note, I have to say that I also like (CF1) for personal reasons. I'm a low-dimensional topologist by training, so I'm fond of handlebodies.

We can think of (CF1) as a neural network incarnation of the following fact. If you have a background in Morse theory, it should be straightforward to supply a proof that does not appeal to (CF1):

Cool Fact 2 (CF2): Let $m,n \in \mathbb{Z}^+$, let $B^n$ be the unit ball, and let $\mathcal{P}$ be a set of finitely many points on $\partial(B^n) = S^{n-1}$, the boundary of $B^n$.  Suppose $\mathcal{S}_1, \ldots, \mathcal{S}_m$ are $m$ subsets of $\mathcal{P}$. Then there exists a continuous function $f=(f_1, \ldots, f_m): B^n \rightarrow [0,1]^m$ that satisfies the property that $f_i$ sends $\mathcal{S}_i$ to $1$ and $\mathcal{P}\setminus \mathcal{S}_i$ to $0$ for all $i$. Moreover, each $f_i$ can be assumed to be Morse.

Here's an example in the case $n=2$ and $m=1$:

4 points on the boundary of a 2-dimensional disk, along with (the graph of) a Morse function to the unit interval that sends the blue points to 1 and the red points to 0. (The blue and red lines are there just to make the picture on the right easier to understand.) To believe (CF2) completely in the 2D case, imagine that the disk is made of pizza dough. Invite your friends to pick up all of the blue points and lift them to height 1 above a table. The red points will stay at height 0, and the pizza dough will look like the graph of a Morse function with the desired properties.


If we note that $B^n$ is homeomorphic to the unit $n$-cube, $[0,1]^n$, and we let $\mathcal{P}$ be the set of its vertices, $\{0,1\}^n \subseteq \partial([0,1]^n)$, (CF1) is saying that we can construct the function in (CF2) using a neural network.

Now that we're feeling motivated, let's get to it.

There are only four things we need to understand for everything to fall into place.
  1. Every boolean function $\{0,1\}^n \rightarrow \{0,1\}$ can be realized as a composition involving ANDs, NOTs, and ORs. (In fact, all you really need is NOT AND, often abbreviated NAND.)
  2. ANDs, NOTs, and ORs can all be modeled by perceptrons.
  3. Perceptrons can be approximated arbitrarily closely by sigmoid neurons.
  4. An $m$-output neural network can be constructed by stacking $m$ independent $1$-output neural networks.
In a certain sense, the first three points are the takeaways from my previous post about perceptrons. I definitely didn't appreciate the implications at that time, though, so I'll start by re-explaining some things quickly, then put it all together at the end.

Thing 1: Every $1$-output boolean function is a composition of ANDs, NOTs, and ORs

You can find this statement in the first chapter of any electrical engineering textbook, because it's the foundational fact underpinning circuit design. What does it mean?

Formally: Given any $n$-input, $1$-output boolean function 

$f: \{0,1\}^n \rightarrow \{0,1\}$ 

(there are $2^{2^n}$ possible such functions) we can realize $f$ as a composition of AND, NOT, & OR functions on the inputs. Recall:
  • The "AND" function takes two binary inputs and outputs a $1$ iff both inputs are $1$s. That is, if we have two binary inputs $*_1 \in \{0,1\}$ and $*_2 \in \{0,1\}$, then $(*_1 \mbox{ AND } *_2) = 1$ iff $*_1 = *_2 = 1$.
  • The "OR" function takes two binary inputs and outputs a $0$ iff both inputs are $0$. That is, $(*_1 \mbox{ OR } *_2) = 0$ iff $*_1 = *_2 = 0$.
  • The "NOT" function takes a single binary input and flips it. So $\mbox{NOT}(*) = 1$ iff $*=0$.
For the purposes of the remaining discussion, let's use the standard abbreviations of:
  • AND by $\wedge$, 
  • OR by $\vee$, and 
  • NOT by $\lnot$.
OK, so why is it that we can realize any $n$-input boolean function as a composition of the simple binary functions $\wedge$, $\vee$, and $\lnot$? Well, for a sort of stupid reason. One perfectly sensible way to describe a boolean function is just to provide a complete list of all of the inputs that map to $1$. This is the idea behind the so-called canonical representation of a boolean function. The universal existence of the canonical representation of a boolean function is the proof of (S1) we'll describe here.

It's easiest (and clearest) to describe the canonical representation by example. So I'll do that. 

Suppose we consider the boolean function $\{0,1\}^3 \rightarrow \{0,1\}$ that maps $(0,1,0)$ and $(1,1,1)$ to $1$ and every other binary $3$-tuple to $0$. This boolean function can be written as:

$(\lnot(*_1) \wedge (*_2) \wedge \lnot (*_3)) \vee (*_1 \wedge *_2 \wedge *_3)$,

which of course looks like jibberish when you look at it like that, so I'll draw an electrical "circuit" picture of it instead:

The canonical representation of the boolean function
that maps (0,1,0) and (1,1,1) to 1, and all other binary 3-tuples to 0

If it still seems like jibberish, here's the point. For any fixed binary $n$-tuple $(a_1, \ldots, a_n)$, there is a unique expression in ANDs and AND NOTs that outputs $1$ iff the input, $(*_1, \ldots, *_n)$, is equal to $(a_1, \ldots, a_n)$. To create this expression, just concatenate the $*_i$'s using AND when $a_i = 1$ and using AND NOT when $a_i = 0$.

So, e.g., the function $\lnot(*_1) \wedge (*_2) \wedge \lnot(*_3)$ outputs $1$ iff $(*_1,*_2,*_3) = (0,1,0)$.

Now to form the canonical representation of a boolean function, just list all of the inputs that map to $1$, form the expressions as above that uniquely identifies each of these inputs, then concatenate all of these expressions using ORs. 

Thing 2: AND, NOT, and OR can be modeled by perceptrons

Thing 2 was the main point of my post about perceptrons. An $n$-input, $1$-output perceptron uniquely specifies a function that assigns a $1$ to one side of an affine hyperplane in $\mathbb{R}^n$ and a $0$ to the other. Such functions are called threshold functions in the literature. Geometrically, the weights and bias associated to a perceptron are irrelevant, except insofar as they define the co-oriented affine hyperplane that separates $0$ outputs from $1$ outputs.

We can therefore understand Thing 2 by understanding the pair of statements:

  • the $2$-input boolean functions "AND" and "OR" can be realized by threshold functions, and 
  • we can realize the $1$-input boolean function "NOT" by reversing the co-orientation of the affine hyperplane defining the threshold function.
The second statement of the pair is clear from the definition of a threshold function, and the first statement becomes clear once you stare at the following two pictures while recalling the definitions of "AND", "OR".

The threshold functions that map the points on the right (arrow-side) of each of these red affine
hyperplanes to 1 and the left (non-arrow-side) to 0 extend the 2-input boolean functions AND & OR

Thing 3: Perceptrons can be approximated arbitrarily closely by sigmoid neurons


Thing 3 was also explained briefly in my post about perceptrons (look at the pictures at the end, in the "And now back to that thought we were holding" section). The key observation is that the sigmoid functions

$\sigma(t) = \frac{1}{1+e^{-ct}}$

are nice smooth approximations of the step function

$S(t) = \left\{\begin{array}{cl} 0 & \mbox{if } t < 0\\1 & \mbox{if } t \geq 0\end{array}\right.$,

and these approximations get successively better as $c \rightarrow \infty$.

Indeed, the sigmoid functions converge pointwise to the Heaviside step function (really a better choice for a step function anyway, because of its symmetry):

$H(t) = \left\{\begin{array}{cl}0 & \mbox{if } t < 0\\\frac{1}{2} & \mbox{if } t = 0\\1 & \mbox{if } t > 0\end{array}\right.$

Once you understand this, all you need to appreciate is that an $n$-input sigmoid neuron is just a perceptron whose step function has been replaced by a sigmoid function.

More precisely, recall that a (co-oriented) affine hyperplane $H$ in $\mathbb{R}^n$ is determined (non-uniquely) by a weight vector $\vec{w}\in \mathbb{R}^n$ and a bias $b \in \mathbb{R}$ as follows:

$H_{\vec{w},b} = \{\vec{x} \in \mathbb{R}^n \,\,\vert\,\, \vec{w}\cdot\vec{x} - b = 0\}$.

The weight vector $\vec{w}$ is just a vector orthogonal to the hyperplane, pointing in the direction of the co-orientation. The bias $b \in \mathbb{R}$ determines the shift of this hyperplane from the origin (its absolute value is equal to the distance of $H$ away from the origin in the case $\vec{w}$ has length $1$).

With this understood, we now see that a threshold function is just the composition of the step function with an affine linear function. That is, the threshold function associated to $H_{\vec{w},b}$ is simply $S(\vec{w}\cdot \vec{x} -b)$. And this makes it clear why a corresponding sigmoid neuron $\sigma(\vec{w}\cdot \vec{x} -b)$ can be viewed as a smooth approximation to it. N.B.: this approximation can be made arbitrarily good by scaling $\vec{w}$ and $b$ by the same large $c>0$.

Thing 4: We can construct an $m$-output neural network by "stacking" $m$ independent $1$-output NN's

This is a neural network analog of the statement that a function to a product can (should?) be defined component by component.

Informally, what I mean is the following. If you have $m$ $1$-output neural networks, all with the same $n$ inputs, and you want to build a neural network whose output layer is precisely those $m$ outputs, just stack the $m$ $1$-output neural networks vertically, then merge the $m$ identical copies of the input layer. Here are some pictures, to illustrate what I mean:

Two separate neural networks with the same inputs, each with
a single output, drawn one above the other

The two separate neural nets above have now been merged
into a single one, with two outputs


The key to keeping these $m$ parallel computations independent from each other is to force edges to have $0$ weight whenever you don't want the output of the edge to depend on the input. Here are another couple of pictures, to illustrate what I mean by that:

The blue nodes shouldn't interact with the red nodes

So we artificially set all weights on edges connecting them to zero

Putting it all together:

OK, that was long, so let's recap. We're trying to understand why a neural network is a robust enough architecture to extend any boolean function from the vertices to the interior of a unit $n$-cube. I.e., we're trying to sketch an argument of:

Cool Fact 1 (CF1): Let $m,n \in \mathbb{Z}^+$. Any boolean function $f: \{0,1\}^n \rightarrow \{0,1\}^m$ can be approximated arbitrarily well by a finite feedforward neural network with sigmoid activation functions. 

We proceed as follows. Look at the $m$ components $f_1, \ldots, f_m$ of $f$. 

Each of these is a $1$-output boolean function, $f_i: \{0,1\}^n \rightarrow \{0,1\}$. So we can represent it as a composition of ANDs, NOTs, & ORs (e.g., by using the canonical representation, or- if we're lucky-in some other, more succinct, way). Now model each of these ANDs, NOTs, & ORs by threshold functions associated to perceptrons, then replace each perceptron by a sigmoid neuron that approximates it sufficiently closely. We now have $m$ independent $1$-output neural networks. Merge them into a single $m$-output neural network by merging the inputs and zeroing out weights on edges connecting nodes between different neural networks.

And there you have it! If I were less pooped out, I'd close with some further questions. But maybe I'll just leave it there for today.

Monday, August 28, 2017

Neural Networks, Finite State Machines, and Binary Encoding by Affine Hyperplanes

I'll focus today on the relationship between neural networks and finite state machines (aka finite state automata, or FSAs for short). Why? Interweb searches have informed me that the ability of a neural network to "learn" how to model finite state machines is one reason people first got excited about neural networks. Even better, FSAs are reasonably robust models for computation. If you build one big enough, you have yourself something that looks (to first order) like a sentient being. Not a very smart sentient being, but it'll do as a start...

Before I begin, I want to mention a couple of people with whom I've been chatting semi-regularly about related stuff: Jesse Johnson and Tony Licata. They've both got a lot of other interesting things going on in their mathematical lives, but I've really benefitted from their turning their attention in this direction on occasion.

I will also link to a number of sources I found when I googled around about this. I am 100% sure I am missing some of the important references, since I am not an expert in this area. OTOH, if you're looking for other important references, they are very likely to be among the ones cited in the sources below. The point here is that none of these ideas are new to the world. They are just new to me:



Finite State Automata

OK, what's a finite state automaton? Informally, it's a collection of (finitely many) states and (finitely many) rules for transitioning between them.

More precisely/formally, the core structure of a finite state automaton is a triple $(\mathcal{S},\mathcal{T}, \mu)$, where

  • $\mathcal{S} = \{s_1, \ldots s_N\}$ is a set of finitely many states,
  • $\mathcal{T} = \{t_1, \ldots, t_M\}$ is a set of finitely many transitions, and
  • $\mu: \mathcal{S} \times \mathcal{T} \rightarrow \mathcal{S}$ is a function that assigns to a pair $(s_i, t_j)$ the result of applying transition $t_j$ to state $s_i$.
(The finite state automata one encounters in nature may have a bit more structure, e.g. a distinguished initial state and maybe an output function or something like that. But the core is what's described above.)

One common and natural way to describe a finite state automaton with $N$ states and $M$ transitions is via a directed, labeled graph constructed as follows.

  • The vertices of the graph are labeled by the states, $s_1, \ldots, s_N$. 
  • For each $i \in \{1, \ldots, N\}$ there are $M$ directed edges with tails on the vertex $s_i$.
  • These $M$ edges are labeled $t_1, \ldots, t_M$, and the head of edge $t_j$ with tail on $s_i$ is the state $\mu(s_i,t_j)$.

Let's come up with an example, so it's not so boring abstract. I'll describe an FSA modeling the finitely many emotional states of an extremely simple (but...let's face it...sort of interesting) human named Charlotte with extremely simple (but also...let's face it...sort of interesting) reactions to time spent with her two friends Alice and Bob.

The set of Charlotte's emotional states is $\{\mbox{Happy}, \mbox{Mad}, \mbox{Confused}, \mbox{Sad}\}$, and the set of transitions is $\{\mbox{spend time with Alice}, \mbox{spend time with Bob}\}$. Spending time with Alice generally makes Charlotte mad, and spending time with Bob generally makes Charlotte confused. That is, unless she's seen Alice or Bob more than once before seeing the other. If this happens, Charlotte becomes sad, and any further time spent with Alice or Bob does nothing for her. Here's the FSA graph modeling this situation:




Poor Charlotte. She needs some new friends.

OK, we have an example. But what can a finite state automaton do? Well, I'm gonna be honest...nothing too terribly exciting beyond telling you where you'll go next if you are at state $s_i$ and transition $t_j$ happens. On the other hand, a finite state automaton does provide one framework for solving decision problems.

How? Well, given an ordered list of transitions, it can calculate the ordered list of states it passes through via that ordered list of transitions. This provides a means of modeling a decision problem whose inputs are finite-length strings in a finite alphabet. Simply designate some subset of the states as "Accept" and the remainder as "Fail" states. Now output a "1" if the end state is an accept state and a "0" if the end state is a fail state.

Going back to the example of Charlotte's FSA, someone (her therapist?) might choose to designate "Sad" as the only fail state, and the other 3 states as accept states. It's now not so hard to see that given a finite length string describing Charlotte's interactions with Alice and Bob, the FSA will output a $1$ if and only if it's a non-stuttering string.

So, e.g., the string $ABABABABA$ will output a $1$ and the string $BBABA$ will output a $0$.

BTW, while we're on the subject: finite state automata appear to play a prominent role in the theoretical study of how humans use and interpret language. And I should also note that this study overlaps with some areas of great interest to theoretical mathematicians. For example, the group theorists among you may notice that Poor Charlotte's FSA is precisely the one we'd use to answer the question of whether a word representing an element of the group $\mathbb{Z}/2\mathbb{Z} * \mathbb{Z}/2\mathbb{Z}$ is reduced or unreduced. That's not so hard to see. I learned about it from this amazing book.

Neural Networks as Functions: Basic Structure

The meaning of the words "neural network" depends very much on who is uttering them, and in what context.

When I say those words here, I will mean a function $F: \mathbb{R}^n \rightarrow \mathbb{R}^{n'}$ between two vector spaces (possibly of different dimensions) that is realized as a composition, $F =  F_m \ldots \circ F_1 \circ F_0$, of finitely many functions $F_k: \mathbb{R}^{n_k} \rightarrow \mathbb{R}^{n_{k+1}}$, each of which might be described informally as a smoothed binary encoding by affine hyperplanes.

I'll unpack that slowly.

Let me start with the kind of drawing of a neural network one often sees in the news, since it will organize us:



We represent a neural network by a bunch of nodes (neurons) and lines (connections). The nodes are arranged in columns called layers. The number of hidden layers (those layers between the input layer on the left and the output layer on the right) tells us how many smoothed binary encoding functions we need to compose to describe the function, and the number of nodes in each layer tells us the dimensions of the vector spaces we encounter along the way.

In the picture I drew above, the neural network is a function $F: \mathbb{R}^3 \rightarrow \mathbb{R}^5$, since the left-most layer has 3 nodes, and the right-most layer has 5. Moreover,

$F = F_2 \circ F_1 \circ F_0,$

where

  • $F_0: \mathbb{R}^3 \rightarrow \mathbb{R}^2$
  • $F_1: \mathbb{R}^2 \rightarrow \mathbb{R}^4$
  • $F_2: \mathbb{R}^4 \rightarrow \mathbb{R}^5$.

Now that we understand the basic structure of a neural network, let's focus in on these functions $F_i$. The right way to think about them is as smoothed versions of binary encoding functions, where the encoding is done with reference to a collection of affine hyperplanes.

Let me start by saying what I mean by a binary encoding function by affine hyperplanes, then explain why this is the right picture to have in mind to start understanding neural networks.

Binary Encoding by Affine Hyperplanes

First, when I say an affine hyperplane in a vector space $\mathbb{R}^n$, I mean just a linear subset of $\mathbb{R}^n$ of dimension $n-1$ (e.g., a line if $n=2$, a plane if $n=3$) that doesn't necessarily pass through the origin, $\vec{0} \in \mathbb{R}^n$. In other words, it's a linear subspace of $\mathbb{R}^n$ to which a translation has perhaps been applied.

If you've recently taken a course in linear algebra (or if you've read one of my earlier posts on this blog, e.g. this one), you know that an affine hyperplane is always (non-uniquely) describable as the set of solutions of a single generic linear equation. That is, fix a nonzero vector $\vec{a} \in \mathbb{R}^n$ and a nonzero number $b \in \mathbb{R}$. Then an affine hyperplane is describable as the set $\vec{x} \in \mathbb{R}^n$ satisfying $\vec{a} \cdot \vec{x} = b$.

One important point is that any affine hyperplane in $\mathbb{R}^n$ divides its complement into two pieces. We can see this easily from the description of the hyperplane as the set of solutions to a linear equation. The points on the hyperplane defined by $\vec{a} \in \mathbb{R}^n, b \in \mathbb{R}$ are those $\vec{x} \in \mathbb{R}^n$ satisfying $\vec{a} \cdot \vec{x} = b$. The remainder of the vector space is separated into:

$H_{(\vec{a},b)}^- := \{\vec{x} \in \mathbb{R}^n \,\,|\,\, \vec{a}\cdot \vec{x} < b\}$ and
$H_{(\vec{a},b)}^+ := \{\vec{x} \in \mathbb{R}^n \,\,|\,\, \vec{a}\cdot \vec{x} > b\}$.

Here's an example of what I mean, for the hyperplane in $\mathbb{R}^2$ described as the set of points $\vec{x} \in \mathbb{R}^2$ satisfying $(1,1)\cdot \vec{x} = 1$:



Notice that we're repeating (in slightly different language) some things we said in earlier posts. Namely, an affine hyperplane $H$ linearly separates its complement into two pieces. So we have a natural binary function on the complement:

$\theta: \mathbb{R}^n \setminus H \rightarrow \{0,1\}$

which assigns a $0$ to points in $H^-$ and a $1$ to points in $H^+$. Great!

Now, suppose you have $m$ (pairwise different) hyperplanes $H_1, \ldots, H_m$ in $\mathbb{R}^n$, and you've chosen a pair $(\vec{a}_i,b_i)$ describing each. Then you can define a multi-dimensional binary encoding function

$\Theta: \mathbb{R}^n \setminus (H_1 \cup \ldots \cup H_m) \rightarrow \{0,1\}^m$ 

from the complement of the union of the hyperplanes to the set of binary $m$-tuples. Just define the $i$th component of the function to be 
  • $0$ if $\vec{a}_i \cdot \vec{x} < b_i$, and
  • $1$ if $\vec{a}_i \cdot \vec{x} > b_i$. 
Moreover, the value this function assigns to a vector $\vec{x} \in \mathbb{R}^n$ depends only on which component in the complement of the hyperplanes $\vec{x} \in \mathbb{R}^n$ is in.

Here's an example of a binary encoding by 5 affine hyperplanes in $\mathbb{R}^2$. Rather than specifying an equation for each affine hyperplane, I've simply chosen a so-called co-orientation for each (an arrow that chooses which of the two sides is assigned a $1$ in the binary encoding):



Phew! OK, that took longer than expected to explain. But it's a simple idea, right?

Smoothing a Binary Encoding by Affine Hyperplanes

The maps between layers of a neural network are really smoothed versions of this binary encoding function. Rather than having a step function along the hyperplanes for each component of the function ($0$ on one side, $1$ on the other):



you have a sigmoid function $\frac{1}{1 + e^{-x}}$ that interpolates between $0$ and $1$ (but never really achieves either value):




as one passes from one side of a hyperplane to the other.

Indeed, this is precisely what the function between adjacent layers of a neural network is. Each component of the function $F_k: \mathbb{R}^{n_k} \rightarrow \mathbb{R}^{n_{k+1}}$ from the $k$th to the $(k+1)$st layer is of the form $\sigma(\vec{a} \cdot \vec{x} - b)$ for some $\vec{a} \in \mathbb{R}^{n_k}, b \in \mathbb{R}$. 

So that's what each layer of a neural network is: a smoothed version of a binary encoding function by affine hyperplanes. This is a (maybe) different way of understanding stuff we discussed before. Woot!

A (Recurrent) Neural Network for Poor Charlotte's FSA

Now suppose we are trying to model a finite state automaton with $N$ states and $M$ transitions in some natural way via a neural network. The right thing to aim for is a way of representing our states by points $s_1, \ldots, s_N$ in a vector space, $\mathbb{R}^k$, and our transitions by points $t_1, \ldots, t_M$ in a vector space, $\mathbb{R}^{\ell}$. (Where in both cases the points are sort of "fuzzy." Maybe it's better to think of them as small-ish open balls centered on points in a vector space.) We should then be looking for a function

$F: (\mathbb{R}^k \times \mathbb{R}^\ell) \rightarrow \mathbb{R}^\ell$ satisfying:

  • $F$ is a composition of smoothed binary encoding functions
  • $F(s_i,t_j) = \mu(s_i,t_j)$.

Note that we're abusing notation in the second line above. The inputs to the function $F$ on the left are "fuzzy" points in $\mathbb{R}^k \times \mathbb{R}^\ell$, while the inputs to the function $\mu$ on the right are the formal states and transitions defining the FSA.

Now notice that if we're given an ordered list of transitions, then these can be fed into the neural network above in sequence. Moreover, the output to the neural network at one stage becomes part of the input to the neural network at the next stage. People call this type of gadget a recurrent neural network.

Example! Example!

OK. Now we're ready to describe a recurrent neural network model for Poor Charlotte's FSA. Recall that Poor Charlotte's $4$ states are $\{\mbox{Happy}, \mbox{Mad}, \mbox{Confused}, \mbox{Sad}\}$ and her $2$ transitions are $\{\mbox{Spend time with Alice}, \mbox{Spend time with Bob}\}$.

Let's encode the states (in the same order) in $\mathbb{R}^2$ as $\{(0,0), (1,0), (0,1), (1,1)\}$ and the transitions (again, in the same order) in $\mathbb{R}^2$ as $\{(1,0), (0,1)\}$.

Then the binary-encoded version of Poor Charlotte's FSA is the function $F: \mathbb{R}^2 \oplus \mathbb{R}^2 \rightarrow \mathbb{R}^2$ that assigns the following values to the encoded (state, transition) pairs:

$(0,0,1,0) \mapsto (1,0)$
$(0,0,0,1) \mapsto (0,1)$
$(1,0,1,0) \mapsto (1,1)$
$(1,0,0,1) \mapsto (0,1)$
$(0,1,1,0) \mapsto (1,0)$
$(0,1,0,1) \mapsto (1,1)$
$(1,1,1,0) \mapsto (1,1)$
$(1,1,0,1) \mapsto (1,1)$

Here's a diagram of the encoded FSA, viewed as a directed labeled graph:




Now I claim that we can find $2$ affine hyperplanes in $\mathbb{R}^4$ that separate the "data" in the first column (inputs) as described in the second column (outputs). To prove this, we just need to argue that we can find $a_1, a_2, a_3, a_4,b \in \mathbb{R}$ such that:

\begin{eqnarray*}
a_3 &>& b\\
a_4 &<& b\\
a_1 + a_3 &>& b\\
a_1 + a_4 &<& b\\
a_2 + a_3 &>& b\\
a_2 + a_4 &>& b\\
a_1 + a_2 + a_3 &>& b\\
a_1 + a_2 + a_4 &>& b
\end{eqnarray*}

and $c_1, c_2, c_3, c_4, d \in \mathbb{R}$ such that:

\begin{eqnarray*}
c_3 &<& d\\
c_4 &>& d\\
c_1 + c_3 &>& d\\
c_1 + c_4 &>& d\\
c_2 + c_3 &<& d\\
c_2 + c_4 &>& d\\
c_1 + c_2 + c_3 &>& d\\
c_1 + c_2 + c_4 &>& d
\end{eqnarray*}

This is something you can check by hand in this case by, e.g., trying to plot numbers satisfying the above inequalities on a number line and seeing whether you succeed or fail. Here are two plots that look like they work for these two collections of inequalities:



It's not a general method, true. But it at least gets us started thinking about how this might go in general. Farkas' Lemma (sometimes called Gale's Theorem) seems like the right result to use to understand this linear separability question in general, so maybe I'll say more about this later if I have the inclination. I learned of it (though not quite yet about it) from this awesome set of lecture notes on-line. And, of course, this tells us nothing about how to non-linearly separate a binary encoding. This would necessarily require a deeper recurrent neural network, i.e. one with more layers.

A topic for another time.






Tuesday, February 28, 2017

Perceptrons, Neurons, and Learning Logic


OK, fine. I'll tell you about perceptrons.

Warning 1: These are the guys most responsible for getting me hooked on this subject. They may hook you too.

And credit where credit's due: I learned pretty much everything written below from the sections on Perceptrons and Sigmoid Neurons in Chapter 1 of Michael Nielsen's fantastic free on-line Neural Networks and Deep Learning book. In fact, if you'd rather skip my post and just go read those sections now, I won't be offended.

I will, however, insist that afterward you check out the frickin' awesome neural network playground visualization tool created by Daniel Smilkov and Shan Carter as part of this larger project. You can fiddle with various parameters and see in real time how your fiddlings affect the way that a neural net learns to separate various non-linearly separable $2$-dimensional data sets. Well worth the $\geq 20$ minutes of your life it will cost you to do so.

Warning 2: It's going to seem for a moment that we're on a different planet, but bear with me. We'll be back on Earth again soon, I swear.

Perceptrons

Suppose you want to make a Yes/No decision based on your own personal weighting of answers to a finite collection of Yes/No questions.

(I want us all to pause for a moment and appreciate the following fact: This is basically how we do make decisions.)

The example Michael Nielsen gives in his book (Attend a cheese festival: Yes or No?) is far better than anything I can come up with, so I won't try. Instead I'll keep things abstract and just refer you there if you want an example.

A perceptron is a simple mathematical gadget that models this type of decision. People usually draw perceptrons like this:

An n-input perceptron


The $n$ nodes on the left (ordered from top to bottom, say, and labeled $x_1, \ldots, x_n$) represent the inputs. Each input will either be a "0" ("No") or a "1" ("Yes"), so there will be $2^n$ possible inputs to the perceptron. Let's call these inputs $x_1, \ldots, x_n$.

The output to the perceptron will also be a "0" or "1." How does the perceptron choose? Well, the perceptron has $n$ weights, $w_1, \ldots, w_n \in \mathbb{R}$, assigned to the $n$ input nodes, along with an overall threshold, $b \in \mathbb{R}$. Its output is then given by:
\[\left\{\begin{array}{cl} 0 & \mbox{if } \sum_{i=1}^n w_ix_i < b, { and}\\
1 & \mbox{if } \sum_{i=1}^n w_ix_i \geq b
\end{array}\right.\]

In other words, if the weighted sum of the inputs achieves or exceeds the threshold, output $1$. If it doesn't, output $0$.

If you read my post about linearly separating data, this should be starting to sound eerily familiar. Explicitly, a perceptron with $n$ inputs answers the following concrete question:

"Let ${\vec x} \in \{0,1\}^n \subset \mathbb{R}^n$ be a binary $n$-tuple, regarded as an $n$-dimensional real vector. On which side of the hyperplane defined by weight vector $\vec{w} := (w_1, \ldots, w_n)$ and shift $b$ does it lie?"

But when we think about it in these terms, the first thing that should jump to our minds is: Why on earth should a perceptron accept only binary inputs? There's nothing stopping it from accepting real-valued inputs. In fact, the geometric interpretation of what a perceptron is doing is somehow more natural when we allow our inputs to be any real numbers and not just $0$s and $1$s.

OK. Hold that thought while I tell you something else completely unrelated but even cooler.

Any logical sentence (or "circuit") can be built by concatenating perceptrons:

This is what blew my mind when I first heard about these guys.

If you've ever studied elementary logic (maybe you took an elementary proofs course or something), you've probably built yourself a truth table or two. If you haven't, don't worry. I'll tell you what they are and what they do.

(And if you're interested in what they have to do with mathematical proofs, you might like these notes written by Michael Hutchings, which I often use as a first reading assignment whenever I teach Intro to Proofs at Boston College.)

At its core, elementary logic is a language or "calculus" that allows us to specify how to take a finite, ordered tuple of $0$s and $1$s (a so-called binary string), perform a collection of operations on it, and obtain another ordered tuple of $0$s and $1$s. The $4$ basic operations in this language are:

  1. $*_1$ And $*_2$, 
  2. $*_1$ Or $*_2$, 
  3. If $*_1$ Then $*_2$,
  4. Not $*$
The *s above represent inputs. Note that the first three operations take two inputs, and the last takes one.

The binary output or value of each of the statements above is determined by a so-called truth table (it's a bit like a multiplication table). It tells you, e.g., that the output of the statement "$*_1$ and $*_2$"is $0$ unless both $*_1$ and $*_2$ are $1$. This should make sense to you since a statement like "I love cheese and it's 10 AM" is only true if the statements "I love cheese" and "It's 10 AM" are both true.

So the truth table for ($*_1$ And $*_2$) looks like this:



The inputs $*_1$ and $*_2$ are along the left and the top, and the outputs for the various corresponding combinations are recorded in the interior. You can easily find the truth tables for the other logical operations online (or in Hutchings' notes), so I won't record those here.

OK, great. The point now is that you can model any of these operations using a perceptron

I'll do "And," then leave the other three as exercises. Here's the picture:



So e.g. if we let $w_1 = w_2 = 10$, and $b = 15$, then the perceptron will return $1$ iff both inputs are $1$, as desired.

Note that there are (infinitely) many choices of weights $w_1, w_2$ and shift $b$ that work to model this "And" perceptron; all you need is a hyperplane separating the point $(1,1)$ from the points $(0,0), (1,0), (0,1) \in \mathbb{R}^2$. This is pretty easy to do, and also gives you an idea how to solve the exercises above once you find their truth tables online!

Once you've solved the exercises above, you have all the tinker toys you need to build any simple logical statement out of perceptrons (mathematicians also like using quantifiers like "for all" and "there exists," but let's ignore those). Connect the inputs to outputs in the appropriate way, and you've got your statement. For example, here's a perceptron model of the statement

"If ($*_1$ And $*_2$) Then ($*_3$)":



For fun, you might now imagine letting $*_1, *_2, *_3$ represent the truth values of the statements,

  1. "You're happy," 
  2. "You know it," 
  3. "You clap your hands," resp. 
(Credit to Josh Greene for this.)

Then the perceptron circuit will output a $1$ if, e.g., you're sad and you don't know it and you clap your hands, but a $0$ if you're happy and you know it and you don't clap your hands.

And you can go crazy with it. You can, e.g., make a perceptron circuit that looks like this:



N.B.: It looks like our perceptrons have multiple outputs in this picture. They don't. Each perceptron can have multiple inputs but only a single output. The multiple lines emanating from each perceptron in the pic above correspond to sending that perceptron's output to multiple places.

But I hope that the picture of a perceptron circuit I drew poorly above is starting to look a leeeeeetle bit like the pictures of neural networks we've seen in the news.

And Now Back to That Thought we were Holding:

Perceptron circuits are awesome, but they're pretty rigid. Their inputs and outputs are binary. Not to mention that the rule each perceptron uses to decide between outputting a $0$ and a $1$ is a step function: output $1$ if you're on or to one side of a hyperplane, $0$ if you're on the other side.

Perceptrons use step functions


The rigidity and jumpiness of this set-up doesn't seem ideally-suited to learning, does it?

(It's not.)

BUT what if we fuzz things out a bit, like we were doing before? That is, let's allow real-valued inputs and outputs, and replace our step functions with sigmoid functions. In other words, let's replace each perceptron with a sigmoid neuron. In case you haven't already guessed, a sigmoid neuron still has associated to it an $n$-dimensional weight vector $\vec{w} \in \mathbb{R}^n$ and a threshold $b \in \mathbb{R}$. But it allows its $n$-dimensional input vector $\vec{x}$ to live anywhere in $\mathbb{R}^n$, and its output will interpolate between $0$ and $1$ using a sigmoid function. That is, its output will be $\sigma(\vec{w}\cdot \vec{x} - b) \in (0,1)$. Recall that $\sigma(t) = \frac{1}{1+e^{-t}}$.

Sigmoid neurons use sigmoid functions

Sigmoid functions look an awful lot like step functions if you squint hard at them (increasingly so if we use a sigmoid function like $\frac{1}{1+e^{-ct}}$ and let $c \rightarrow \infty$)


Well, by George, we have a neural network. Moreover, it can, in principle learn logic. At least the elementary models of logic I teach in Math 216. 

All we need now is:

  1.  a reasonable cost function,
  2.  a reasonable way of computing its partial derivatives with respect to the weights and thresholds associated to all of the sigmoid neurons,
  3. buttloads of data for it to learn from.
We'll talk more about that next time.

BTW: I should thank the Isaac Newton Institute in Cambridge for its hospitality, since that's where I am at the moment!


















Wednesday, February 1, 2017

Idiot's Guide (written by an actual Idiot)

I still haven't told you what a neural net is or how one works. (Don't worry. We'll get there soon.)

You may be itching to play around with one all the same.

Unfortunately (or maybe fortunately...) you can't build a neural net with paper clips and string in your living room. And I wouldn't recommend writing much code from scratch either, even if you're already an amazing programmer (I'm not).

Luckily, some actual amazing programmers have already done all the hard work for you. Unluckily, finding which people have done it best and actually understanding how to get your hands on what they've done can sometimes feel harder than just doing the damn thing yourself.

(I don't recommend doing the damn thing yourself.)

The purpose of this post is to record roughly what I did to get myself up and running. Maybe provide a link or two or three. The post is almost entirely selfish, because I know I will never remember how to do this unless I write it down somewhere. May as well write it somewhere where it has a chance of helping others too.

User beware: I have close to zero programming experience and even less experience mucking around with Unix (besides using pine as an undergrad--ugh, dates me terribly, I know...). I can't guarantee that the steps below are maximally efficient or even close. All I know is that I can do what I need to do on my computer now. And the recipe I followed is below.

Ingredient list:

1) Python: This is the programming language everything is written in. Don't know anything about this language? That's OK. There are a buttloads of resources on-line for learning the syntax and basic structure, etc. When I'm writing a program and google a python syntax question, I often end up here. I also spent a good amount of time at the beginning going through this set of lecture notes. Or you could take one of the 10,000 MOOCs on it if you want more direction.
2) Jupyter (iPython) Notebook: This is a web-based environment that will allow you to write python code and execute it in the notebook so you can troubleshoot your code while you're writing it instead of after you've done it all wrong.
3) TensorFlow: This is the machine learning package developed by Google. People seem to like it! As do I, so far!
4) Keras: This is a front end for TensorFlow (it can also work with Theano as a back end--this is an alternative to TensorFlow that I have heard good things about but never used) that is (precisely) one zillion times more user-friendly. I was dreading building a neural net using TensorFlow until someone (Mark Hughes) told me about Keras.


Instructions (for a Mac, which is what I have):

0) Open a Terminal (look in your Applications folder if you don't know what I mean)
1) Install Anaconda (a python distribution with a package manager called "conda" that can be used to install various other packages) using the link in the "Anaconda Installation instructions" found on the TensorFlow website here. Keep in mind that there are lots of different versions of python, and the syntax actually varies quite a bit (annoyingly) among them. I went ahead and got the most recent version of python 3.
2) Create a conda environment called tensorflow by typing (where you may replace "3.5" with the python version number you are using):
      $ conda create -n tensorflow python=3.5
3) Activate the tensorflow environment by typing:
$ source activate tensorflow 
(tensorflow)$  # Your prompt should change
4) Install tensorflow using conda:
# Linux/Mac OS X, Python 2.7/3.4/3.5, CPU only:
(tensorflow)$ conda install -c conda-forge tensorflow
5) At this point, I followed an amazingly helpful answer on Quora found here to the question "How can I work with Keras on a Jupyter notebook with TensorFlow as a backend?" One needs to install ipython, Jupyter, and Keras (in that order) inside the tensorflow environment by typing:
  1. (tensorflow) username$ conda install ipython
  2. (tensorflow) username$ pip install jupyter
  3. (tensorflow) username$ pip install keras
6) Now deactivate and reactivate the tensorflow environment (not sure why you need to do this):

  1. (tensorflow)username$ source deactivate tensorflow
  2. username$ source activate tensorflow

7)  And open the Jupyter notebook (it will open in a browser: I think Safari is the default, at least that's what opens on my computer):
(tensorflow)username$ jupyter notebook
8)  Once the Jupyter notebook opens, navigate to whichever directory you want to use to store your ipython notebooks and click (upper right) New-->Python3 and you'll be in a Jupyter notebook. You can execute python code by doing a Shift-Return. You might try writing "print('hello, world')" to make sure you've got it. (It should output "hello, world" obvs).

9) To shut everything down, save your notebook, close the browser window, do a Ctrl-c at your terminal and answer "Y" when it asks you whether you want to shut down your Jupyter notebook.

10) You'll be back at the tensorflow prompt, at which point you deactivate tensorflow environment:
  1. (tensorflow)username$ source deactivate tensorflow

11) Type "Exit" to close your terminal and you're all done.

12) Now whenever you want to open a Jupyter Notebook and use Keras, TensorFlow you just do:
$ source activate tensorflow 
(tensorflow)$  # Your prompt should change
 then
(tensorflow)username$ jupyter notebook
and once you're all finished with your jupyter notebook:
(tensorflow)username$ source deactivate tensorflow
13) BTW, here's the documentation for Keras. I followed this great tutorial to understand how to build an LSTM (a particular neural network architecture that is good for analyzing sequences--will write more about this later) using Keras. Really cool thing I didn't realize until I did this tutorial: you can use Keras to download interesting data sets (this tutorial uses some imdb movie reviews and classifies them into "positive" or "negative")--I haven't explored which ones.

Wednesday, January 11, 2017

How a machine learns (Or: It's the derivative, stupid)


Before I start, let me just say how excited I am about this course I'm sitting in on at MIT this week. I'll try my best to sort of repeat the most interesting tidbits I learn in a future post. I'm starting to worry about how long these posts are taking me, though, so for now, let me just provide a link to this frickin' awesome toy model of a self-driving car. Click the "Run Training" button, and in 10 seconds a neural net (using reinforcement learning) teaches the toy car to "drive" (BTW, the code runs in the browser--I mean WOW.) 

I haven't personally explored the reading material on the Resources tab there, but it all looks useful. Also: the Deep Learning book (Goodfellow, Bengio, Courville) mentioned on the Resources tab was independently recommended by mathematical little bro' Jon Bloom. And two independent recommendations implies it must be good. That's like a theorem, right?

(Umm....no 'tisn't. But I do trust JB's recs about this stuff, so feel confident passing them on.)

OK, back to basics. Last time, I talked about a particular type of problem a machine learning algorithm might want to tackle:

One has a bunch of data points in some high-dimensional parameter space $\mathbb{R}^n$ (where the $n$ parameters are whatever some particular human has decided would be useful to determine the answer to some yes/no question like "Is a particular e-mail spam?"), and these data points have been "pre-classified." That is, each data point has already been assigned a "yes/no" label (e.g., a human has gone through the data somehow and labeled each data point manually as "Yes-spam" or "No-not spam"). Here's a picture of what this might look like in a 2-dimensional parameter space (Red = "Y", Pencil = "N"). Sorry this is so low-tech, but I'm just a mathematician:



Now suppose we (humans) guess that the data is linearly separable, that is we guess that the decision boundary between the "Y" region and the "N" region of the parameter space is just a hyperplane determined by some orthogonal vector $\vec{a} \in \mathbb{R}^n$ and shift $b \in \mathbb{R}$. (A reasonable assumption in the silly example pic above.)

The canonical first example of a machine learning algorithm (which you may as well think of a computer program) is one that will use

  1. our pre-labeled data and
  2. our assumption that the data is linearly separable
to determine the decision boundary. Note that our choice to assume that our decision boundary is a hyperplane tells us that the output of our machine learning algorithm should just be the parameters $\vec{a} \in \mathbb{R}^n$ and $b$ specifying the hyperplane.

OK, so what should the machine do?

Well...what would a human do?

Cost function

That's right! Start by guessing something, then keep trying to improve. Of course, this begs the question, "Improve what?" Without getting too philosophical about it, we need to define some cost function whose input is

  1. our pre-labeled data and
  2. our current guess $\vec{a} \in \mathbb{R}^n$ and $b \in \mathbb{R}$

and spits out a number that we refer to as the cost associated to the guess $\vec{a}, b$. The cost function should be defined in a reasonable enough way that we really feel it's measuring how well we're doing. (High cost = bad.) It should also treat all data points equally. That is, it should be an average over individual cost functions for each data point. (It also helps if it's easy to compute its partial derivatives with respect to the parameters, but we're getting ahead of ourselves.)

Minor but Important notational digression

Remember we have our space $\mathbb{R}^n$ that parameterizes our data (in our spam example, points $\vec{x}$ in this space represent e-mails). But we also have our space $\mathbb{R}^{n+1}$ that parameterizes the possible separating hyperplanes (points $(b, \vec{a})$ in this space represent shifted hyperplanes in the data-parameterizing space $\mathbb{R}^n$).

At this point, it is notationally convenient to imbed the data-parameterizing space $\mathcal{D} \cong  \mathbb{R}^n$ into $\mathcal{D}' \cong \mathbb{R}^{n+1}$ via the map \[\vec{x} := (x_1, \ldots, x_n) \mapsto \vec{x}' := (1,x_1, \ldots, x_n).\] Doing this imbedding allows us to treat the shift (often called the bias in the literature) $b$ associated to a hyperplane on equal footing with the components $a_i$ of the vector $\vec{a} = (a_1, \ldots, a_n)$ specifying its orthogonal direction.

Now suppose we have a point $\vec{a}' := (a_0, a_1, \ldots, a_n) \in \mathbb{R}^{n+1}$. Then remembering that if $\vec{x} \in \mathcal{D}$ then $\vec{x}' = (1, \vec{x})$ denotes its imbedding as the $1$ level in $\mathcal{D}'$, the hyperplane in $\mathcal{D}$ determined by $\vec{a}'$ is the set of points $\vec{x} \in \mathcal{D}$ for which $\vec{a}' \cdot \vec{x}' = 0$ (this is the hyperplane specified by orthogonal vector $(a_1, \ldots, a_n)$ and shift $-a_0$).

Cross-entropy cost function

A commonly used cost function for a problem of this type is the cross-entropy cost function. Before describing it, let me come clean and admit that because of my lack of prior training in statistics I don't yet feel in my bones why this is the "right" choice. I will say that Chapter 2: Classification and Logistic Regression in Andrew Ng's CS229 lecture notes here helped me, as did the discussion of the cross-entropy cost function in Chapter 3 of Michael Nielsen's Deep Learning book (esp. the animations). I may try to write more about this later when I'm smarter.

Relative ignorance disclosed, I proceed:

Remember that for each data point $\vec{x}$, we have a "Y/N" label. Let's turn these into "1/0" labels. That is, for each data point $\vec{x}$, define
\[y(\vec{x}) = \left\{\begin{array}{cl} 1 & \mbox{if the label on } \vec{x} \mbox{ is ``Y"}\\
0 & \mbox{if the label on } \vec{x} \mbox{ is ``N"}\end{array}\right.\]

Remember also that for a particular choice of parameters $\vec{a}' \in \mathbb{R}^{n+1}$ specifying the separating hyperplane (see notational digression above), the output guessed by the model on a point $\vec{x} \in \mathbb{R}^n$ is \[\sigma(\vec{a}' \cdot \vec{x}') \in (0,1),\] where $\sigma$ is the sigmoid function  \[\sigma(t) = \frac{1}{1 + e^{-t}}.\] We can (and should, I suppose) think of \[h_{\vec{a}'}(\vec{x}) := \sigma(\vec{a}'\cdot \vec{x}')\] as the probability that the label on $\vec{x}$ is "Y" (according to the model specified by the vector of parameters $\vec{a}' = (a_0, a_1, \ldots, a_n)$). 

The cross-entropy "cost" or "loss" associated to hyperplane parameters $\vec{a}'$ and a particular data point $\vec{x}$ is then defined as: \[J_{\vec{a}'}(\vec{x}) := -y(\vec{x})\ln(h_{\vec{a}'}(\vec{x})) - (1 - y(\vec{x})) \ln(1-h_{\vec{a}'}(\vec{x})),\] and if we have $N$ pre-labeled data points $\vec{x}^{(1)}, \ldots, \vec{x}^{(N)}$, we define the cost to be the average of the individual costs:
\[J_{\vec{a}}\{\vec{x}^{(1)}, \ldots, \vec{x}^{(N)}\} := \frac{1}{N}\sum_{i=1}^N J_{\vec{a}'}(\vec{x}^{(i)}).\]

Let's pause for a sec, because it's worth noting how much less complicated this function is than it appears at first glance. Remember that $y(\vec{x})$ is only ever $0$ or $1$, so for each data point $\vec{x}^{(i)}$ one of the two terms in the sum drops out and the other is just the negative of the natural log of an expression involving the difference between the actual label (which is either $0$ or $1$) and the predicted label (which is some real number between $0$ and $1$). As far as I can tell, the main reason for the appearance of the natural log here is because:

  1. The more wrong you are, the more you are penalized: Think of the behavior of $-\ln(t)$ for values of t between 0 and 1. Its main feature is that as $t$ goes to $0$, $-\ln(t)$ goes to $\infty$, and as $t$ goes to $1$, $-\ln(t)$ goes to $0$. As a result, when the difference $y(\vec{x}) - h_{\vec{a}'}(\vec{x})$ between the actual label and the predicted label is large (close to $1$), then $1$ minus this difference is close to $0$, so when we take $-\ln$ of this number, we get something huge. The hugeness goes to $0$ as the discrepancy between the actual label and the predicted label goes to $0$.
  2. Having $\ln$ in the expression makes it easier to take partial derivatives of the cost function: Read the next section to understand why this is.


Gradient descent

Now that we have a way of measuring how well we've done for a particular choice of parameters $\vec{a}'$, calculus tells us how we might do a bit better (at least nearby).

Remember we have our space $\mathcal{D} = \mathbb{R}^n$ that parameterizes our data, and we have our space $\mathbb{R}^{n+1}$ that parameterizes the possible separating hyperplanes. We now want to think of our cost function as a function from our hyperplane-parameterizing space to $\mathbb{R}$ and locate the point(s) in our hyperplane-parameterizing space that minimize cost.

Here is where calculus helps us. Remember that if we have a real-valued smooth function of several parameters and we are interested in finding an absolute minimum of the function (and our domain is boundary-less), we know for sure that this absolute minimum will occur at a local minimum of the function. And every local minimum of the function will in particular have the property that all of the partial derivatives of the function vanish there.

This tells us that to find an absolute minimum of the cost function we should look for places where the partial derivatives of the cost function vanish (aka critical points of the function). As a first pass, this is a great idea as long as we remember that not every critical point is a local minimum, and not every local minimum is an absolute minimum. If you like Morse theory, you don't need to be convinced of those two statements. If you don't yet like Morse theory, the following two pics may convince you (again, low tech--sorry):

A portion of the graph of a real-valued function whose domain is 2-dimensional.
The partial derivatives at the black dot vanish, but the black dot is not a local minimum.


A portion of the graph of a real-valued function whose domain is 1-dimensional.
The black dot is a local minimum but not an absolute minimum.

Since it is not always easy (read: computationally advantageous) to find the critical points of a cost function directly, what is done in practice is to compute the gradient of the cost as a function of the hyperplane-parameterizing space: \[\nabla(J) := \left(\frac{\partial J}{\partial a_0}, \frac{\partial J}{\partial a_1}\ldots, \frac{\partial J}{\partial a_n},\right)\] and take small-ish steps in the negative gradient direction until your gradient gets close enough to $\vec{0}$ that you're pretty darned sure you're close to a critical point. This is called gradient descent, and it's what puts the "learning" in "machine learning." To first order, it's really that simple.

(WARNING: Keep your wits about you. The gradient of the cost function is computed with respect to the hyperplane-parameterizing space NOT the data-parameterizing space. The machine is trying to learn the best hyperplane to fit the data, which it is treating as a constant.)

Of course, one of the big problems with gradient descent is that unless you happen to know in advance that your function has only one critical point and that it is definitely a local minimum, then the result of performing gradient descent will not necessarily find you a global minimum. I'll probably try to talk about this at greater length later. Let's ignore it for now, compute the gradient of the cross-entropy cost function, and call it a day.

Gradient of cross-entropy cost function (+ a useful property of the sigmoid function)

Now we just dust off the ole' chain rule, keeping in mind the following useful observation about the sigmoid function $\sigma(t) := \frac{1}{1 + e^{-t}}$: \[\frac{d\sigma}{dt} = \sigma(t)(1 - \sigma(t)).\] This fact also uses the chain rule in the first step. The rest is just algebra:
\begin{eqnarray*}
 \frac{d\sigma}{dt} &=& \frac{e^{-t}}{(1+e^{-t})^2}\\
 &=& \frac{1}{1+e^{-t}}\cdot\frac{(1+e^{-t}) - 1}{1 + e^{-t}}\\
 &=& \sigma(t)(1 - \sigma(t))
\end{eqnarray*}

Now it will turn out that \[\frac{\partial J}{\partial a_j} = \frac{1}{N} \sum_{i=1}^N (h_{\vec{a}'}(\vec{x}^{(i)}) - y^{(i)})x_j^{(i)},\] where we recall that $h_{\vec{a}'}(\vec{x}) := \sigma(\vec{a}' \cdot \vec{x}')$. This derivation has some slightly non-obvious steps, so I include it here. Note that the partial derivative of a sum is the sum of partial derivatives, so we just need to verify that the partial derivative $\frac{\partial J}{\partial a_j}$ associated to a particular data point $\vec{x}^{(i)}$ has the form we see inside the summation above. If $y(\vec{x}^{(i)}) = 1$, the chain rule and the fact above about the derivative of the sigmoid function tells us this is:

\begin{eqnarray*}
 &=& -y(\vec{x}^{(i)}) \cdot \frac{1}{\sigma(\vec{a}'\cdot(\vec{x}^{(i)})')}\cdot \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')\cdot (1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
 &=& y(\vec{x}^{(i)}) \cdot (\sigma(\vec{a}'\cdot(\vec{x}^{(i)})') - 1) \cdot x_j^{(i)}\\
&=& (h_{\vec{a}}(\vec{x}^{(i)}) - y(\vec{x}^{(i)})) \cdot x_j^{(i)}
\end{eqnarray*}

where in the last step we are using the assumption that $y(\vec{x}^{(i)}) = 1$. Similarly, if $y(\vec{x}^{(i)}) = 0$, the summand reads:

\begin{eqnarray*}
 &=& (1-y(\vec{x}^{(i)})) \cdot \frac{1}{1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')}\cdot \sigma(\vec{a}'\cdot(\vec{x}^{(i)})')\cdot (1 - \sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
 &=& (1- y(\vec{x}^{(i)})) \cdot (\sigma(\vec{a}'\cdot(\vec{x}^{(i)})') \cdot x_j^{(i)}\\
&=& (h_{\vec{a}}(\vec{x}^{(i)}) - y(\vec{x}^{(i)})) \cdot x_j^{(i)}
\end{eqnarray*}

where again in the last step, we are using the assumption that $y(\vec{x}^{(i)}) = 0$.

OK, I'm tired now and probably you are too. I'm also sure the equations here are riddled with sign errors and other less forgivable mistakes. Please tell me if you notice any (forgivable or not). I'll try to have fewer equations in later posts.