We're focusing not as much on the practical computational aspects of neural networks as on the theoretical computer science/mathematical aspects. It's not that the practical aspects aren't important (they are, of course). It's just that at the moment I want to learn about the theoretical aspects.
And I do what I want. (Not really, but doesn't that sound fun?)
Before I start in on the math, let me just say: Man, it's nice to be talking regularly about this stuff with actual people in real life. Much of the perspective I take below (especially the connection to classical Morse theory) arose during chats with Nate B.
My goal for today is to convince you of the following cool fact.
Cool Fact 1 (CF1): Let $m,n \in \mathbb{Z}^+$. Any boolean function $f: \{0,1\}^n \rightarrow \{0,1\}^m$ can be approximated arbitrarily well by a finite feedforward neural network with sigmoid activation functions.
More precisely, there exists a neural network with an $n$-node input layer and an $m$-node output layer whose associated smooth function $F: \mathbb{R}^n \rightarrow (0,1)^m$ can be extended continuously to a function $F: \mathbb{R}^n \rightarrow [0,1]^m$ that (approximately!) agrees with $f$ on the vertices, $\{0,1\}^n$, of the unit $n$-cube.
I should note first of all that this fact has been known for decades. Indeed, it is implied by the much harder universal approximation theorem for sigmoid feedforward neural networks (the google tells me this is attributed to Cybenko and Funahashi, who gave independent proofs both published in 1989). The universal approximation theorem says that any smooth function from $(0,1)^n \rightarrow (0,1)^m$ can be approximated arbitrarily closely by a sufficiently large feedforward neural network with sigmoid activation functions.
What I like about the more modest statement in bold (CF1) above is that
- (CF1) is enough to tell you that, in principle, a sufficiently large trained neural network can at least do the sorts of calculations a computer program can. After all, at heart these calculations are nothing more than boolean functions.
- Finding a sensible way to extend a boolean function from the vertices of an $n$-cube to its interior is a beautiful thing to do, since you are essentially finding a probabilistic function that extends a deterministic one. That is, you're replacing binary ("Yes/No") inputs/outputs with inputs/outputs in the interval $(0,1)$ ("maybe Yes, with probability $p \in (0,1)$").
- Understanding why (CF1) is true gets you a long way towards a rudimentary picture of how a neural network is "reasoning". And that's what we're after.
As a side note, I have to say that I also like (CF1) for personal reasons. I'm a low-dimensional topologist by training, so I'm fond of handlebodies.
We can think of (CF1) as a neural network incarnation of the following fact. If you have a background in Morse theory, it should be straightforward to supply a proof that does not appeal to (CF1):
Cool Fact 2 (CF2): Let $m,n \in \mathbb{Z}^+$, let $B^n$ be the unit ball, and let $\mathcal{P}$ be a set of finitely many points on $\partial(B^n) = S^{n-1}$, the boundary of $B^n$. Suppose $\mathcal{S}_1, \ldots, \mathcal{S}_m$ are $m$ subsets of $\mathcal{P}$. Then there exists a continuous function $f=(f_1, \ldots, f_m): B^n \rightarrow [0,1]^m$ that satisfies the property that $f_i$ sends $\mathcal{S}_i$ to $1$ and $\mathcal{P}\setminus \mathcal{S}_i$ to $0$ for all $i$. Moreover, each $f_i$ can be assumed to be Morse.
Here's an example in the case $n=2$ and $m=1$:
If we note that $B^n$ is homeomorphic to the unit $n$-cube, $[0,1]^n$, and we let $\mathcal{P}$ be the set of its vertices, $\{0,1\}^n \subseteq \partial([0,1]^n)$, (CF1) is saying that we can construct the function in (CF2) using a neural network.
Now that we're feeling motivated, let's get to it.
We can think of (CF1) as a neural network incarnation of the following fact. If you have a background in Morse theory, it should be straightforward to supply a proof that does not appeal to (CF1):
Cool Fact 2 (CF2): Let $m,n \in \mathbb{Z}^+$, let $B^n$ be the unit ball, and let $\mathcal{P}$ be a set of finitely many points on $\partial(B^n) = S^{n-1}$, the boundary of $B^n$. Suppose $\mathcal{S}_1, \ldots, \mathcal{S}_m$ are $m$ subsets of $\mathcal{P}$. Then there exists a continuous function $f=(f_1, \ldots, f_m): B^n \rightarrow [0,1]^m$ that satisfies the property that $f_i$ sends $\mathcal{S}_i$ to $1$ and $\mathcal{P}\setminus \mathcal{S}_i$ to $0$ for all $i$. Moreover, each $f_i$ can be assumed to be Morse.
Here's an example in the case $n=2$ and $m=1$:
If we note that $B^n$ is homeomorphic to the unit $n$-cube, $[0,1]^n$, and we let $\mathcal{P}$ be the set of its vertices, $\{0,1\}^n \subseteq \partial([0,1]^n)$, (CF1) is saying that we can construct the function in (CF2) using a neural network.
Now that we're feeling motivated, let's get to it.
There are only four things we need to understand for everything to fall into place.
- Every boolean function $\{0,1\}^n \rightarrow \{0,1\}$ can be realized as a composition involving ANDs, NOTs, and ORs. (In fact, all you really need is NOT AND, often abbreviated NAND.)
- ANDs, NOTs, and ORs can all be modeled by perceptrons.
- Perceptrons can be approximated arbitrarily closely by sigmoid neurons.
- An $m$-output neural network can be constructed by stacking $m$ independent $1$-output neural networks.
In a certain sense, the first three points are the takeaways from my previous post about perceptrons. I definitely didn't appreciate the implications at that time, though, so I'll start by re-explaining some things quickly, then put it all together at the end.
Thing 1: Every $1$-output boolean function is a composition of ANDs, NOTs, and ORs
You can find this statement in the first chapter of any electrical engineering textbook, because it's the foundational fact underpinning circuit design. What does it mean?
Formally: Given any $n$-input, $1$-output boolean function
$f: \{0,1\}^n \rightarrow \{0,1\}$
(there are $2^{2^n}$ possible such functions) we can realize $f$ as a composition of AND, NOT, & OR functions on the inputs. Recall:
- The "AND" function takes two binary inputs and outputs a $1$ iff both inputs are $1$s. That is, if we have two binary inputs $*_1 \in \{0,1\}$ and $*_2 \in \{0,1\}$, then $(*_1 \mbox{ AND } *_2) = 1$ iff $*_1 = *_2 = 1$.
- The "OR" function takes two binary inputs and outputs a $0$ iff both inputs are $0$. That is, $(*_1 \mbox{ OR } *_2) = 0$ iff $*_1 = *_2 = 0$.
- The "NOT" function takes a single binary input and flips it. So $\mbox{NOT}(*) = 1$ iff $*=0$.
For the purposes of the remaining discussion, let's use the standard abbreviations of:
- AND by $\wedge$,
- OR by $\vee$, and
- NOT by $\lnot$.
OK, so why is it that we can realize any $n$-input boolean function as a composition of the simple binary functions $\wedge$, $\vee$, and $\lnot$? Well, for a sort of stupid reason. One perfectly sensible way to describe a boolean function is just to provide a complete list of all of the inputs that map to $1$. This is the idea behind the so-called canonical representation of a boolean function. The universal existence of the canonical representation of a boolean function is the proof of (S1) we'll describe here.
It's easiest (and clearest) to describe the canonical representation by example. So I'll do that.
Suppose we consider the boolean function $\{0,1\}^3 \rightarrow \{0,1\}$ that maps $(0,1,0)$ and $(1,1,1)$ to $1$ and every other binary $3$-tuple to $0$. This boolean function can be written as:
$(\lnot(*_1) \wedge (*_2) \wedge \lnot (*_3)) \vee (*_1 \wedge *_2 \wedge *_3)$,
which of course looks like jibberish when you look at it like that, so I'll draw an electrical "circuit" picture of it instead:
If it still seems like jibberish, here's the point. For any fixed binary $n$-tuple $(a_1, \ldots, a_n)$, there is a unique expression in ANDs and AND NOTs that outputs $1$ iff the input, $(*_1, \ldots, *_n)$, is equal to $(a_1, \ldots, a_n)$. To create this expression, just concatenate the $*_i$'s using AND when $a_i = 1$ and using AND NOT when $a_i = 0$.
So, e.g., the function $\lnot(*_1) \wedge (*_2) \wedge \lnot(*_3)$ outputs $1$ iff $(*_1,*_2,*_3) = (0,1,0)$.
Now to form the canonical representation of a boolean function, just list all of the inputs that map to $1$, form the expressions as above that uniquely identifies each of these inputs, then concatenate all of these expressions using ORs.
$(\lnot(*_1) \wedge (*_2) \wedge \lnot (*_3)) \vee (*_1 \wedge *_2 \wedge *_3)$,
which of course looks like jibberish when you look at it like that, so I'll draw an electrical "circuit" picture of it instead:
The canonical representation of the boolean function that maps (0,1,0) and (1,1,1) to 1, and all other binary 3-tuples to 0 |
If it still seems like jibberish, here's the point. For any fixed binary $n$-tuple $(a_1, \ldots, a_n)$, there is a unique expression in ANDs and AND NOTs that outputs $1$ iff the input, $(*_1, \ldots, *_n)$, is equal to $(a_1, \ldots, a_n)$. To create this expression, just concatenate the $*_i$'s using AND when $a_i = 1$ and using AND NOT when $a_i = 0$.
So, e.g., the function $\lnot(*_1) \wedge (*_2) \wedge \lnot(*_3)$ outputs $1$ iff $(*_1,*_2,*_3) = (0,1,0)$.
Now to form the canonical representation of a boolean function, just list all of the inputs that map to $1$, form the expressions as above that uniquely identifies each of these inputs, then concatenate all of these expressions using ORs.
Thing 2: AND, NOT, and OR can be modeled by perceptrons
Thing 2 was the main point of my post about perceptrons. An $n$-input, $1$-output perceptron uniquely specifies a function that assigns a $1$ to one side of an affine hyperplane in $\mathbb{R}^n$ and a $0$ to the other. Such functions are called threshold functions in the literature. Geometrically, the weights and bias associated to a perceptron are irrelevant, except insofar as they define the co-oriented affine hyperplane that separates $0$ outputs from $1$ outputs.
We can therefore understand Thing 2 by understanding the pair of statements:
Thing 3: Perceptrons can be approximated arbitrarily closely by sigmoid neurons
Thing 3 was also explained briefly in my post about perceptrons (look at the pictures at the end, in the "And now back to that thought we were holding" section). The key observation is that the sigmoid functions
$\sigma(t) = \frac{1}{1+e^{-ct}}$
are nice smooth approximations of the step function
$S(t) = \left\{\begin{array}{cl} 0 & \mbox{if } t < 0\\1 & \mbox{if } t \geq 0\end{array}\right.$,
and these approximations get successively better as $c \rightarrow \infty$.
Indeed, the sigmoid functions converge pointwise to the Heaviside step function (really a better choice for a step function anyway, because of its symmetry):
$H(t) = \left\{\begin{array}{cl}0 & \mbox{if } t < 0\\\frac{1}{2} & \mbox{if } t = 0\\1 & \mbox{if } t > 0\end{array}\right.$
Once you understand this, all you need to appreciate is that an $n$-input sigmoid neuron is just a perceptron whose step function has been replaced by a sigmoid function.
More precisely, recall that a (co-oriented) affine hyperplane $H$ in $\mathbb{R}^n$ is determined (non-uniquely) by a weight vector $\vec{w}\in \mathbb{R}^n$ and a bias $b \in \mathbb{R}$ as follows:
$H_{\vec{w},b} = \{\vec{x} \in \mathbb{R}^n \,\,\vert\,\, \vec{w}\cdot\vec{x} - b = 0\}$.
The weight vector $\vec{w}$ is just a vector orthogonal to the hyperplane, pointing in the direction of the co-orientation. The bias $b \in \mathbb{R}$ determines the shift of this hyperplane from the origin (its absolute value is equal to the distance of $H$ away from the origin in the case $\vec{w}$ has length $1$).
With this understood, we now see that a threshold function is just the composition of the step function with an affine linear function. That is, the threshold function associated to $H_{\vec{w},b}$ is simply $S(\vec{w}\cdot \vec{x} -b)$. And this makes it clear why a corresponding sigmoid neuron $\sigma(\vec{w}\cdot \vec{x} -b)$ can be viewed as a smooth approximation to it. N.B.: this approximation can be made arbitrarily good by scaling $\vec{w}$ and $b$ by the same large $c>0$.
Thing 4: We can construct an $m$-output neural network by "stacking" $m$ independent $1$-output NN's
This is a neural network analog of the statement that a function to a product can (should?) be defined component by component.
Informally, what I mean is the following. If you have $m$ $1$-output neural networks, all with the same $n$ inputs, and you want to build a neural network whose output layer is precisely those $m$ outputs, just stack the $m$ $1$-output neural networks vertically, then merge the $m$ identical copies of the input layer. Here are some pictures, to illustrate what I mean:
The key to keeping these $m$ parallel computations independent from each other is to force edges to have $0$ weight whenever you don't want the output of the edge to depend on the input. Here are another couple of pictures, to illustrate what I mean by that:
Thing 2 was the main point of my post about perceptrons. An $n$-input, $1$-output perceptron uniquely specifies a function that assigns a $1$ to one side of an affine hyperplane in $\mathbb{R}^n$ and a $0$ to the other. Such functions are called threshold functions in the literature. Geometrically, the weights and bias associated to a perceptron are irrelevant, except insofar as they define the co-oriented affine hyperplane that separates $0$ outputs from $1$ outputs.
We can therefore understand Thing 2 by understanding the pair of statements:
- the $2$-input boolean functions "AND" and "OR" can be realized by threshold functions, and
- we can realize the $1$-input boolean function "NOT" by reversing the co-orientation of the affine hyperplane defining the threshold function.
The threshold functions that map the points on the right (arrow-side) of each of these red affine hyperplanes to 1 and the left (non-arrow-side) to 0 extend the 2-input boolean functions AND & OR |
Thing 3: Perceptrons can be approximated arbitrarily closely by sigmoid neurons
Thing 3 was also explained briefly in my post about perceptrons (look at the pictures at the end, in the "And now back to that thought we were holding" section). The key observation is that the sigmoid functions
$\sigma(t) = \frac{1}{1+e^{-ct}}$
are nice smooth approximations of the step function
$S(t) = \left\{\begin{array}{cl} 0 & \mbox{if } t < 0\\1 & \mbox{if } t \geq 0\end{array}\right.$,
and these approximations get successively better as $c \rightarrow \infty$.
Indeed, the sigmoid functions converge pointwise to the Heaviside step function (really a better choice for a step function anyway, because of its symmetry):
$H(t) = \left\{\begin{array}{cl}0 & \mbox{if } t < 0\\\frac{1}{2} & \mbox{if } t = 0\\1 & \mbox{if } t > 0\end{array}\right.$
Once you understand this, all you need to appreciate is that an $n$-input sigmoid neuron is just a perceptron whose step function has been replaced by a sigmoid function.
More precisely, recall that a (co-oriented) affine hyperplane $H$ in $\mathbb{R}^n$ is determined (non-uniquely) by a weight vector $\vec{w}\in \mathbb{R}^n$ and a bias $b \in \mathbb{R}$ as follows:
$H_{\vec{w},b} = \{\vec{x} \in \mathbb{R}^n \,\,\vert\,\, \vec{w}\cdot\vec{x} - b = 0\}$.
The weight vector $\vec{w}$ is just a vector orthogonal to the hyperplane, pointing in the direction of the co-orientation. The bias $b \in \mathbb{R}$ determines the shift of this hyperplane from the origin (its absolute value is equal to the distance of $H$ away from the origin in the case $\vec{w}$ has length $1$).
With this understood, we now see that a threshold function is just the composition of the step function with an affine linear function. That is, the threshold function associated to $H_{\vec{w},b}$ is simply $S(\vec{w}\cdot \vec{x} -b)$. And this makes it clear why a corresponding sigmoid neuron $\sigma(\vec{w}\cdot \vec{x} -b)$ can be viewed as a smooth approximation to it. N.B.: this approximation can be made arbitrarily good by scaling $\vec{w}$ and $b$ by the same large $c>0$.
Thing 4: We can construct an $m$-output neural network by "stacking" $m$ independent $1$-output NN's
This is a neural network analog of the statement that a function to a product can (should?) be defined component by component.
Informally, what I mean is the following. If you have $m$ $1$-output neural networks, all with the same $n$ inputs, and you want to build a neural network whose output layer is precisely those $m$ outputs, just stack the $m$ $1$-output neural networks vertically, then merge the $m$ identical copies of the input layer. Here are some pictures, to illustrate what I mean:
Two separate neural networks with the same inputs, each with a single output, drawn one above the other |
The two separate neural nets above have now been merged into a single one, with two outputs |
The key to keeping these $m$ parallel computations independent from each other is to force edges to have $0$ weight whenever you don't want the output of the edge to depend on the input. Here are another couple of pictures, to illustrate what I mean by that:
The blue nodes shouldn't interact with the red nodes |
So we artificially set all weights on edges connecting them to zero |
OK, that was long, so let's recap. We're trying to understand why a neural network is a robust enough architecture to extend any boolean function from the vertices to the interior of a unit $n$-cube. I.e., we're trying to sketch an argument of:
Cool Fact 1 (CF1): Let $m,n \in \mathbb{Z}^+$. Any boolean function $f: \{0,1\}^n \rightarrow \{0,1\}^m$ can be approximated arbitrarily well by a finite feedforward neural network with sigmoid activation functions.
We proceed as follows. Look at the $m$ components $f_1, \ldots, f_m$ of $f$.
Each of these is a $1$-output boolean function, $f_i: \{0,1\}^n \rightarrow \{0,1\}$. So we can represent it as a composition of ANDs, NOTs, & ORs (e.g., by using the canonical representation, or- if we're lucky-in some other, more succinct, way). Now model each of these ANDs, NOTs, & ORs by threshold functions associated to perceptrons, then replace each perceptron by a sigmoid neuron that approximates it sufficiently closely. We now have $m$ independent $1$-output neural networks. Merge them into a single $m$-output neural network by merging the inputs and zeroing out weights on edges connecting nodes between different neural networks.
And there you have it! If I were less pooped out, I'd close with some further questions. But maybe I'll just leave it there for today.
This comment has been removed by a blog administrator.
ReplyDeleteNice Blog, The way you have explained about the things is very well and the suggestions are the best.
ReplyDeleteKeep it up!!!
Read What are Browser hijacker.
This company will always be there to withdraw your web money, perfect money and bitcoin fund through bitcoin atm card . This company not only facilitates you with its services but also understands the stress that you go through while exchanging your e-money.
ReplyDeleteThank for Baì viết hữu ích. Mình cũng muốn giới thiệu về một Công ty dịch thuật uy tín - Công ty CP dịch thuật miền trung - MIDtrans địa chỉ 02 Hoàng Diệu, TP Đồng Hới, tỉnh Quảng Bình có Giấy phép kinh doanh số 3101023866 cấp ngày 9/12/2016 là đơn vị chuyên cung cấp dịch vụ dịch thuật, phiên dịch dành các cá nhân. Hệ thống thương hiệu và các Công ty dịch thuật con trực thuộc: văn phòng dịch thuật sài gòn 247 địa chỉ 47 Điện Biên Phủ, Phường Đakao, Quận 1 TP HCM, dịch thuật bình bình phước : địa chỉ 100 , Lê lợi, TX Bình Long là nhà cung ứng dịch vụ dịch thuật uy tín hàng đầu tại bình phước vietnamese translate : dịch vụ dịch thuật cho người nước ngoài có nhu cầu, giao diện tiếng Anh dễ sử dụng; dịch thuật công chứng quận 11 (mười một) : nhà cung ứng dịch vụ dịch vụ dịch thuật phiên dịch hàng đầu tại Quận 11 (mười một), TP HCM; công ty dịch thuật Đà Nẵng : Địa chỉ 54 Đinh Tiên Hoàng, Quận Hải Châu, TP Đà Nẵng chuyên cung cấp dịch vụ dịch thuật công chứng, dịch thuật chuyên ngành tại Đà Nẵng. Chúng tôi chuyên cung cấp các dịch vụ biên dịch và phiên dịch, dịch thuật công chứng chất lượng cao hơn 50 ngôn ngữ khác nhau như tiếng Anh, Nhật, Hàn, Trung, Pháp, Đức, Nga, Tây Ban Nha, Bồ Đào Nha, Ý, Ba Lan, Phần Lan, Thái Lan, Hà Lan, Rumani, Lào, Campuchia, Philippin, Indonesia, La Tinh, Thụy Điển, Malaysia, Thổ Nhĩ Kỳ..vv... Dịch thuật MIDtrans tự hào với đội ngũ lãnh đạo với niềm đam mê, khát khao vươn tầm cao trong lĩnh vực dịch thuật, đội ngũ nhân sự cống hiến và luôn sẵn sàng cháy hết mình. Chúng tôi phục vụ từ sự tậm tâm và cố gắng từ trái tim những người dịch giả.Tự hào là công ty cung cấp dịch thuật chuyên ngành hàng đầu với các đối tác lớn tại Việt nam trong các chuyên ngành hẹp như: y dược (bao gồm bệnh lý), xây dựng (kiến trúc), hóa chất, thủy nhiệt điện, ngân hàng, tài chính, kế toán. Các dự án đã triển khai của Công ty dịch thuật chuyên nghiệp MIDtrans đều được Khách hàng đánh giá cao và đạt được sự tín nhiệm về chất lượng biên phiên dịch đặc biệt đối với dịch hồ sơ thầu , dịch thuật tài liệu tài chính ngân hàng, dịch thuật tài liệu y khoa đa ngữ chuyên sâu. Đó là kết quả của một hệ thống quản lý chất lượng dịch thuật chuyên nghiệp, những tâm huyết và kinh nghiệm biên phiên dịch nhiều năm của đội ngũ dịch giả của chúng tôi. Hotline: 0947688883. email: info@dichthuatmientrung.com.vn . Các bạn ghé thăm site ủng hộ nhé. Cám ơn nhiều
ReplyDeleteReally Liked the information you have provided. I have an article relaed to it. I was searching about it on the internet and I found an amazing article on Microsoft Official Site. The provided Article was about a site that provides working modded android apps. The name of the site was “Fineapkapps”. The Article was very halpful, You should read that. Click Here to reach that amazing article: Free Modded Apps.
ReplyDeleteAre you looking for repairing services for your electronic gadgets? SD Computers aim to offer modified Computer repairing services with the help of our IT specialists. Moreover, we are reliable to provide Laptop repairing services for your software and hardware issues. No matter what kind of problems you are facing, we are here to find solutions.
ReplyDeleteComputer repairing services
Laptop repairing services