Monday, August 28, 2017

Neural Networks, Finite State Machines, and Binary Encoding by Affine Hyperplanes

I'll focus today on the relationship between neural networks and finite state machines (aka finite state automata, or FSAs for short). Why? Interweb searches have informed me that the ability of a neural network to "learn" how to model finite state machines is one reason people first got excited about neural networks. Even better, FSAs are reasonably robust models for computation. If you build one big enough, you have yourself something that looks (to first order) like a sentient being. Not a very smart sentient being, but it'll do as a start...

Before I begin, I want to mention a couple of people with whom I've been chatting semi-regularly about related stuff: Jesse Johnson and Tony Licata. They've both got a lot of other interesting things going on in their mathematical lives, but I've really benefitted from their turning their attention in this direction on occasion.

I will also link to a number of sources I found when I googled around about this. I am 100% sure I am missing some of the important references, since I am not an expert in this area. OTOH, if you're looking for other important references, they are very likely to be among the ones cited in the sources below. The point here is that none of these ideas are new to the world. They are just new to me:



Finite State Automata

OK, what's a finite state automaton? Informally, it's a collection of (finitely many) states and (finitely many) rules for transitioning between them.

More precisely/formally, the core structure of a finite state automaton is a triple $(\mathcal{S},\mathcal{T}, \mu)$, where

  • $\mathcal{S} = \{s_1, \ldots s_N\}$ is a set of finitely many states,
  • $\mathcal{T} = \{t_1, \ldots, t_M\}$ is a set of finitely many transitions, and
  • $\mu: \mathcal{S} \times \mathcal{T} \rightarrow \mathcal{S}$ is a function that assigns to a pair $(s_i, t_j)$ the result of applying transition $t_j$ to state $s_i$.
(The finite state automata one encounters in nature may have a bit more structure, e.g. a distinguished initial state and maybe an output function or something like that. But the core is what's described above.)

One common and natural way to describe a finite state automaton with $N$ states and $M$ transitions is via a directed, labeled graph constructed as follows.

  • The vertices of the graph are labeled by the states, $s_1, \ldots, s_N$. 
  • For each $i \in \{1, \ldots, N\}$ there are $M$ directed edges with tails on the vertex $s_i$.
  • These $M$ edges are labeled $t_1, \ldots, t_M$, and the head of edge $t_j$ with tail on $s_i$ is the state $\mu(s_i,t_j)$.

Let's come up with an example, so it's not so boring abstract. I'll describe an FSA modeling the finitely many emotional states of an extremely simple (but...let's face it...sort of interesting) human named Charlotte with extremely simple (but also...let's face it...sort of interesting) reactions to time spent with her two friends Alice and Bob.

The set of Charlotte's emotional states is $\{\mbox{Happy}, \mbox{Mad}, \mbox{Confused}, \mbox{Sad}\}$, and the set of transitions is $\{\mbox{spend time with Alice}, \mbox{spend time with Bob}\}$. Spending time with Alice generally makes Charlotte mad, and spending time with Bob generally makes Charlotte confused. That is, unless she's seen Alice or Bob more than once before seeing the other. If this happens, Charlotte becomes sad, and any further time spent with Alice or Bob does nothing for her. Here's the FSA graph modeling this situation:




Poor Charlotte. She needs some new friends.

OK, we have an example. But what can a finite state automaton do? Well, I'm gonna be honest...nothing too terribly exciting beyond telling you where you'll go next if you are at state $s_i$ and transition $t_j$ happens. On the other hand, a finite state automaton does provide one framework for solving decision problems.

How? Well, given an ordered list of transitions, it can calculate the ordered list of states it passes through via that ordered list of transitions. This provides a means of modeling a decision problem whose inputs are finite-length strings in a finite alphabet. Simply designate some subset of the states as "Accept" and the remainder as "Fail" states. Now output a "1" if the end state is an accept state and a "0" if the end state is a fail state.

Going back to the example of Charlotte's FSA, someone (her therapist?) might choose to designate "Sad" as the only fail state, and the other 3 states as accept states. It's now not so hard to see that given a finite length string describing Charlotte's interactions with Alice and Bob, the FSA will output a $1$ if and only if it's a non-stuttering string.

So, e.g., the string $ABABABABA$ will output a $1$ and the string $BBABA$ will output a $0$.

BTW, while we're on the subject: finite state automata appear to play a prominent role in the theoretical study of how humans use and interpret language. And I should also note that this study overlaps with some areas of great interest to theoretical mathematicians. For example, the group theorists among you may notice that Poor Charlotte's FSA is precisely the one we'd use to answer the question of whether a word representing an element of the group $\mathbb{Z}/2\mathbb{Z} * \mathbb{Z}/2\mathbb{Z}$ is reduced or unreduced. That's not so hard to see. I learned about it from this amazing book.

Neural Networks as Functions: Basic Structure

The meaning of the words "neural network" depends very much on who is uttering them, and in what context.

When I say those words here, I will mean a function $F: \mathbb{R}^n \rightarrow \mathbb{R}^{n'}$ between two vector spaces (possibly of different dimensions) that is realized as a composition, $F =  F_m \ldots \circ F_1 \circ F_0$, of finitely many functions $F_k: \mathbb{R}^{n_k} \rightarrow \mathbb{R}^{n_{k+1}}$, each of which might be described informally as a smoothed binary encoding by affine hyperplanes.

I'll unpack that slowly.

Let me start with the kind of drawing of a neural network one often sees in the news, since it will organize us:



We represent a neural network by a bunch of nodes (neurons) and lines (connections). The nodes are arranged in columns called layers. The number of hidden layers (those layers between the input layer on the left and the output layer on the right) tells us how many smoothed binary encoding functions we need to compose to describe the function, and the number of nodes in each layer tells us the dimensions of the vector spaces we encounter along the way.

In the picture I drew above, the neural network is a function $F: \mathbb{R}^3 \rightarrow \mathbb{R}^5$, since the left-most layer has 3 nodes, and the right-most layer has 5. Moreover,

$F = F_2 \circ F_1 \circ F_0,$

where

  • $F_0: \mathbb{R}^3 \rightarrow \mathbb{R}^2$
  • $F_1: \mathbb{R}^2 \rightarrow \mathbb{R}^4$
  • $F_2: \mathbb{R}^4 \rightarrow \mathbb{R}^5$.

Now that we understand the basic structure of a neural network, let's focus in on these functions $F_i$. The right way to think about them is as smoothed versions of binary encoding functions, where the encoding is done with reference to a collection of affine hyperplanes.

Let me start by saying what I mean by a binary encoding function by affine hyperplanes, then explain why this is the right picture to have in mind to start understanding neural networks.

Binary Encoding by Affine Hyperplanes

First, when I say an affine hyperplane in a vector space $\mathbb{R}^n$, I mean just a linear subset of $\mathbb{R}^n$ of dimension $n-1$ (e.g., a line if $n=2$, a plane if $n=3$) that doesn't necessarily pass through the origin, $\vec{0} \in \mathbb{R}^n$. In other words, it's a linear subspace of $\mathbb{R}^n$ to which a translation has perhaps been applied.

If you've recently taken a course in linear algebra (or if you've read one of my earlier posts on this blog, e.g. this one), you know that an affine hyperplane is always (non-uniquely) describable as the set of solutions of a single generic linear equation. That is, fix a nonzero vector $\vec{a} \in \mathbb{R}^n$ and a nonzero number $b \in \mathbb{R}$. Then an affine hyperplane is describable as the set $\vec{x} \in \mathbb{R}^n$ satisfying $\vec{a} \cdot \vec{x} = b$.

One important point is that any affine hyperplane in $\mathbb{R}^n$ divides its complement into two pieces. We can see this easily from the description of the hyperplane as the set of solutions to a linear equation. The points on the hyperplane defined by $\vec{a} \in \mathbb{R}^n, b \in \mathbb{R}$ are those $\vec{x} \in \mathbb{R}^n$ satisfying $\vec{a} \cdot \vec{x} = b$. The remainder of the vector space is separated into:

$H_{(\vec{a},b)}^- := \{\vec{x} \in \mathbb{R}^n \,\,|\,\, \vec{a}\cdot \vec{x} < b\}$ and
$H_{(\vec{a},b)}^+ := \{\vec{x} \in \mathbb{R}^n \,\,|\,\, \vec{a}\cdot \vec{x} > b\}$.

Here's an example of what I mean, for the hyperplane in $\mathbb{R}^2$ described as the set of points $\vec{x} \in \mathbb{R}^2$ satisfying $(1,1)\cdot \vec{x} = 1$:



Notice that we're repeating (in slightly different language) some things we said in earlier posts. Namely, an affine hyperplane $H$ linearly separates its complement into two pieces. So we have a natural binary function on the complement:

$\theta: \mathbb{R}^n \setminus H \rightarrow \{0,1\}$

which assigns a $0$ to points in $H^-$ and a $1$ to points in $H^+$. Great!

Now, suppose you have $m$ (pairwise different) hyperplanes $H_1, \ldots, H_m$ in $\mathbb{R}^n$, and you've chosen a pair $(\vec{a}_i,b_i)$ describing each. Then you can define a multi-dimensional binary encoding function

$\Theta: \mathbb{R}^n \setminus (H_1 \cup \ldots \cup H_m) \rightarrow \{0,1\}^m$ 

from the complement of the union of the hyperplanes to the set of binary $m$-tuples. Just define the $i$th component of the function to be 
  • $0$ if $\vec{a}_i \cdot \vec{x} < b_i$, and
  • $1$ if $\vec{a}_i \cdot \vec{x} > b_i$. 
Moreover, the value this function assigns to a vector $\vec{x} \in \mathbb{R}^n$ depends only on which component in the complement of the hyperplanes $\vec{x} \in \mathbb{R}^n$ is in.

Here's an example of a binary encoding by 5 affine hyperplanes in $\mathbb{R}^2$. Rather than specifying an equation for each affine hyperplane, I've simply chosen a so-called co-orientation for each (an arrow that chooses which of the two sides is assigned a $1$ in the binary encoding):



Phew! OK, that took longer than expected to explain. But it's a simple idea, right?

Smoothing a Binary Encoding by Affine Hyperplanes

The maps between layers of a neural network are really smoothed versions of this binary encoding function. Rather than having a step function along the hyperplanes for each component of the function ($0$ on one side, $1$ on the other):



you have a sigmoid function $\frac{1}{1 + e^{-x}}$ that interpolates between $0$ and $1$ (but never really achieves either value):




as one passes from one side of a hyperplane to the other.

Indeed, this is precisely what the function between adjacent layers of a neural network is. Each component of the function $F_k: \mathbb{R}^{n_k} \rightarrow \mathbb{R}^{n_{k+1}}$ from the $k$th to the $(k+1)$st layer is of the form $\sigma(\vec{a} \cdot \vec{x} - b)$ for some $\vec{a} \in \mathbb{R}^{n_k}, b \in \mathbb{R}$. 

So that's what each layer of a neural network is: a smoothed version of a binary encoding function by affine hyperplanes. This is a (maybe) different way of understanding stuff we discussed before. Woot!

A (Recurrent) Neural Network for Poor Charlotte's FSA

Now suppose we are trying to model a finite state automaton with $N$ states and $M$ transitions in some natural way via a neural network. The right thing to aim for is a way of representing our states by points $s_1, \ldots, s_N$ in a vector space, $\mathbb{R}^k$, and our transitions by points $t_1, \ldots, t_M$ in a vector space, $\mathbb{R}^{\ell}$. (Where in both cases the points are sort of "fuzzy." Maybe it's better to think of them as small-ish open balls centered on points in a vector space.) We should then be looking for a function

$F: (\mathbb{R}^k \times \mathbb{R}^\ell) \rightarrow \mathbb{R}^\ell$ satisfying:

  • $F$ is a composition of smoothed binary encoding functions
  • $F(s_i,t_j) = \mu(s_i,t_j)$.

Note that we're abusing notation in the second line above. The inputs to the function $F$ on the left are "fuzzy" points in $\mathbb{R}^k \times \mathbb{R}^\ell$, while the inputs to the function $\mu$ on the right are the formal states and transitions defining the FSA.

Now notice that if we're given an ordered list of transitions, then these can be fed into the neural network above in sequence. Moreover, the output to the neural network at one stage becomes part of the input to the neural network at the next stage. People call this type of gadget a recurrent neural network.

Example! Example!

OK. Now we're ready to describe a recurrent neural network model for Poor Charlotte's FSA. Recall that Poor Charlotte's $4$ states are $\{\mbox{Happy}, \mbox{Mad}, \mbox{Confused}, \mbox{Sad}\}$ and her $2$ transitions are $\{\mbox{Spend time with Alice}, \mbox{Spend time with Bob}\}$.

Let's encode the states (in the same order) in $\mathbb{R}^2$ as $\{(0,0), (1,0), (0,1), (1,1)\}$ and the transitions (again, in the same order) in $\mathbb{R}^2$ as $\{(1,0), (0,1)\}$.

Then the binary-encoded version of Poor Charlotte's FSA is the function $F: \mathbb{R}^2 \oplus \mathbb{R}^2 \rightarrow \mathbb{R}^2$ that assigns the following values to the encoded (state, transition) pairs:

$(0,0,1,0) \mapsto (1,0)$
$(0,0,0,1) \mapsto (0,1)$
$(1,0,1,0) \mapsto (1,1)$
$(1,0,0,1) \mapsto (0,1)$
$(0,1,1,0) \mapsto (1,0)$
$(0,1,0,1) \mapsto (1,1)$
$(1,1,1,0) \mapsto (1,1)$
$(1,1,0,1) \mapsto (1,1)$

Here's a diagram of the encoded FSA, viewed as a directed labeled graph:




Now I claim that we can find $2$ affine hyperplanes in $\mathbb{R}^4$ that separate the "data" in the first column (inputs) as described in the second column (outputs). To prove this, we just need to argue that we can find $a_1, a_2, a_3, a_4,b \in \mathbb{R}$ such that:

\begin{eqnarray*}
a_3 &>& b\\
a_4 &<& b\\
a_1 + a_3 &>& b\\
a_1 + a_4 &<& b\\
a_2 + a_3 &>& b\\
a_2 + a_4 &>& b\\
a_1 + a_2 + a_3 &>& b\\
a_1 + a_2 + a_4 &>& b
\end{eqnarray*}

and $c_1, c_2, c_3, c_4, d \in \mathbb{R}$ such that:

\begin{eqnarray*}
c_3 &<& d\\
c_4 &>& d\\
c_1 + c_3 &>& d\\
c_1 + c_4 &>& d\\
c_2 + c_3 &<& d\\
c_2 + c_4 &>& d\\
c_1 + c_2 + c_3 &>& d\\
c_1 + c_2 + c_4 &>& d
\end{eqnarray*}

This is something you can check by hand in this case by, e.g., trying to plot numbers satisfying the above inequalities on a number line and seeing whether you succeed or fail. Here are two plots that look like they work for these two collections of inequalities:



It's not a general method, true. But it at least gets us started thinking about how this might go in general. Farkas' Lemma (sometimes called Gale's Theorem) seems like the right result to use to understand this linear separability question in general, so maybe I'll say more about this later if I have the inclination. I learned of it (though not quite yet about it) from this awesome set of lecture notes on-line. And, of course, this tells us nothing about how to non-linearly separate a binary encoding. This would necessarily require a deeper recurrent neural network, i.e. one with more layers.

A topic for another time.