ChatGPT在做什么 第5章神经网络Neural Nets
【本章讲chatgtp的“脑神经”】
好的,那么我们的图像识别等任务的典型模型实际上是如何工作的呢? OK, so how do our typical models for tasks like image recognition actually work? 目前最流行也是最成功的方法是使用神经网络。 The most popular—and successful—current approach uses neural nets . 神经网络是在20世纪40年代发明的,其形式与今天的使用非常接近,可以被认为是对大脑工作方式的简单理想化。 Invented—in a form remarkably close to their use today—in the 1940s , neural nets can be thought of as simple idealizations of how brains seem to work . 人类大脑中大约有1000亿个神经元(神经细胞),每个神经元都能产生每秒高达1000次的电脉冲。 In human brains there are about 100 billion neurons (nerve cells), each capable of producing an electrical pulse up to perhaps a thousand times a second. 神经元连接在一个复杂的网络中,每个神经元都有树状的分支,允许它将电信号传递给数千个其他神经元。 The neurons are connected in a complicated net, with each neuron having tree-like branches allowing it to pass electrical signals to perhaps thousands of other neurons. 粗略估计,任何给定神经元在给定时刻是否产生电脉冲取决于它从其他神经元接收到的脉冲——不同的连接有不同的“权重”。【经常走的就成高速路,不常走的就是土路】 And in a rough approximation, whether any given neuron produces an electrical pulse at a given moment depends on what pulses it’s received from other neurons—with different connections contributing with different “weights”. 当我们“看到一幅图像”时,当图像中的光子落在我们眼睛后部的感光细胞上时,它们会在神经细胞中产生电信号。 When we “see an image” what’s happening is that when photons of light from the image fall on (“photoreceptor”) cells at the back of our eyes they produce electrical signals in nerve cells. 这些神经细胞与其他神经细胞相连,最终信号通过一整个序列的神经元层。 These nerve cells are connected to other nerve cells, and eventually the signals go through a whole sequence of layers of neurons. 正是在这个过程中,我们“认出”了这个图像,最终“形成了这样的想法”,我们“看到了一个2”(也许最后会大声说出“2”这个词)。 And it’s in this process that we “recognize” the image, eventually “forming the thought” that we’re “seeing a 2” (and maybe in the end doing something like saying the word “two” out loud). 前一节中的“黑盒”函数是这种神经网络的“数学化”版本。 The “black-box” function from the previous section is a “mathematicized” version of such a neural net. 它碰巧有11层(尽管只有4个“核心层”):关于这个神经网络没有什么特别的“理论推导”; It happens to have 11 layers (though only 4 “core layers”): There’s nothing particularly “theoretically derived” about this neural net; 早在1998年,这只是一项工程,被认为是可行的。 it’s just something that—back in 1998—was constructed as a piece of engineering , and found to work. (当然,这与我们如何描述我们的大脑是通过生物进化过程产生的并没有太大区别。) (Of course, that’s not much different from how we might describe our brains as having been produced through the process of biological evolution.) 好吧,但是像这样的神经网络是如何“识别事物”的呢? OK, but how does a neural net like this “recognize things”? 关键是吸引子的概念。 The key is the notion of attractors . 想象一下,我们有手写的1和2的图像:我们想让所有的1“被吸引到一个地方”,而所有的2“被吸引到另一个地方”。 Imagine we’ve got handwritten images of 1’s and 2’s: We somehow want all the 1’s to “be attracted to one place”, and all the 2’s to “be attracted to another place”. 或者,换句话说,如果一张图像在某种程度上“更接近于1”而不是2,我们希望它最终在“1的位置”,反之亦然。 Or, put a different way, if an image is somehow “closer to being a 1 ” than to being a 2, we want it to end up in the “1 place” and vice versa. 打个简单的比方,假设我们在平面上有一些用点表示的位置(在现实环境中,它们可能是咖啡店的位置)。 As a straightforward analogy, let’s say we have certain positions in the plane, indicated by dots (in a real-life setting they might be positions of coffee shops). 然后我们可以想象,从平面上的任何一点出发,我们总是想在最近的点结束(即我们总是去最近的咖啡店)。 Then we might imagine that starting from any point on the plane we’d always want to end up at the closest dot (i.e. we’d always go to the closest coffee shop). 我们可以通过将平面划分为由理想的“分水岭”分隔的区域(“吸引子盆地”)来表示这一点:我们可以把这看作是实现一种“识别任务”,在这个任务中,我们不是在做识别给定图像“看起来最像”哪个数字之类的事情,而是非常直接地看到给定点最接近哪个点。 We can represent this by dividing the plane into regions (“attractor basins”) separated by idealized “watersheds”: We can think of this as implementing a kind of “recognition task” in which we’re not doing something like identifying what digit a given image “looks most like”—but rather we’re just, quite directly, seeing what dot a given point is closest to. (我们在这里展示的“Voronoi图”设置在2D欧几里得空间中分离点; (The “Voronoi diagram” setup we’re showing here separates points in 2D Euclidean space; 数字识别任务可以被认为是在做一些非常类似的事情——但是是在一个784维的空间中,由每张图像中所有像素的灰度组成。) the digit recognition task can be thought of as doing something very similar—but in a 784-dimensional space formed from the gray levels of all the pixels in each image.) 那么,我们如何让神经网络“执行识别任务”呢? So how do we make a neural net “do a recognition task”? 让我们考虑这个非常简单的例子:我们的目标是取一个对应于位置{x,y}的“输入”,然后将其“识别”为距离它最近的三个点中的任何一个。 Let’s consider this very simple case: Our goal is to take an “input” corresponding to a position {x ,y }—and then to “recognize” it as whichever of the three points it’s closest to. 或者,换句话说,我们想让神经网络计算一个{x,y}的函数: Or, in other words, we want the neural net to compute a function of {x ,y } like:
那么我们如何用神经网络做到这一点呢? So how do we do this with a neural net? 归根结底,神经网络是一个理想化的“神经元”的连接集合——通常是分层排列的——举个简单的例子:每个“神经元”都被有效地设置为评估一个简单的数值函数。 Ultimately a neural net is a connected collection of idealized “neurons”—usually arranged in layers—with a simple example being: Each “neuron” is effectively set up to evaluate a simple numerical function. 为了“使用”网络,我们只需在顶部输入数字(比如我们的坐标x和y),然后让每一层的神经元“评估它们的功能”,并将结果通过网络向前输入,最终在底部产生最终结果: And to “use” the network, we simply feed numbers (like our coordinates x and y ) in at the top, then have neurons on each layer “evaluate their functions” and feed the results forward through the network—eventually producing the final result at the bottom: 在传统的(生物学启发的)设置中,每个神经元都有效地从上一层神经元中获得一组“传入连接”,每个连接都被分配了一定的“权重”(可以是正数或负数)。 In the traditional (biologically inspired) setup each neuron effectively has a certain set of “incoming connections” from the neurons on the previous layer, with each connection being assigned a certain “weight” (which can be a positive or negative number). 给定神经元的值是通过将“先前神经元”的值乘以相应的权重来确定的,然后将这些值相加并加上一个常数,最后应用“阈值”(或“激活”)函数。 The value of a given neuron is determined by multiplying the values of “previous neurons” by their corresponding weights, then adding these up and adding a constant—and finally applying a “thresholding” (or “activation”) function. 用数学术语来说,如果一个神经元有输入x = {x 1, x 2… In mathematical terms, if a neuron has inputs x = {x 1 , x 2 ...}然后我们计算f [w。 } then we compute f [w . X + b],其中权重w和常数b通常是为网络中的每个神经元选择不同的; x + b ], where the weights w and constant b are generally chosen differently for each neuron in the network; 函数f通常是一样的。 the function f is usually the same. 计算w。 Computing w . X + b只是矩阵乘法和加法的问题。 x + b is just a matter of matrix multiplication and addition. “激活函数”f引入了非线性(并最终导致非平凡行为)。 The “activation function” f introduces nonlinearity (and ultimately is what leads to nontrivial behavior). 通常使用各种激活函数; Various activation functions commonly get used; 这里我们只使用Ramp(或ReLU):对于我们希望神经网络执行的每个任务(或等效地,对于我们希望它评估的每个整体函数),我们将有不同的权重选择。 here we’ll just use Ramp (or ReLU): For each task we want the neural net to perform (or, equivalently, for each overall function we want it to evaluate) we’ll have different choices of weights. (正如我们稍后将讨论的那样,这些权重通常是通过使用机器学习从我们想要的输出示例中“训练”神经网络来确定的。) (And—as we’ll discuss later—these weights are normally determined by “training” the neural net using machine learning from examples of the outputs we want.) 最终,每个神经网络只是对应于一些整体的数学函数——尽管写出来可能有点乱。 Ultimately, every neural net just corresponds to some overall mathematical function—though it may be messy to write out. 对于上面的例子,它将是:ChatGPT的神经网络也只是对应于这样的数学函数——但有效地有数十亿项。 For the example above, it would be: The neural net of ChatGPT also just corresponds to a mathematical function like this—but effectively with billions of terms. 但让我们回到单个神经元。 But let’s go back to individual neurons. 下面是两个输入(表示坐标x和y)的神经元可以用不同的权重和常数(Ramp作为激活函数)计算的函数的一些例子:但是来自上面的更大的网络呢? Here are some examples of the functions a neuron with two inputs (representing coordinates x and y ) can compute with various choices of weights and constants (and Ramp as activation function): But what about the larger network from above? 好吧,下面是它计算的内容:它不完全“正确”,但它接近我们上面展示的“最近点”函数。 Well, here’s what it computes: It’s not quite “right”, but it’s close to the “nearest point” function we showed above. 让我们看看其他神经网络会发生什么。 Let’s see what happens with some other neural nets. 在每种情况下,我们都会 In each case, as we’ll
让我们看看其他神经网络会发生什么。 Let’s see what happens with some other neural nets. 在每种情况下,我们都会使用机器学习来找到最佳的权重选择。 In each case, as we’ll explain later, we’re using machine learning to find the best choice of weights. 然后我们在这里展示了具有这些权重的神经网络的计算结果:更大的网络通常在逼近我们的目标函数方面做得更好。 Then we’re showing here what the neural net with those weights computes: Bigger networks generally do better at approximating the function we’re aiming for. 在“每个吸引子盆地的中间”,我们通常会得到我们想要的答案。 And in the “middle of each attractor basin” we typically get exactly the answer we want. 但在边界上——神经网络“很难下定决心”——事情可能会更混乱。 But at the boundaries —where the neural net “has a hard time making up its mind”—things can be messier. 通过这个简单的数学式“识别任务”,“正确答案”是什么就很清楚了。 With this simple mathematical-style “recognition task” it’s clear what the “right answer” is. 但在识别手写数字的问题上,就不那么清楚了。 But in the problem of recognizing handwritten digits, it’s not so clear. 如果有人把“2”写得很糟糕,看起来像“7”,等等? What if someone wrote a “2” so badly it looked like a “7”, etc.? 尽管如此,我们还是可以问神经网络是如何区分数字的——这就给出了一个指示:我们能“从数学上”说网络是如何区分数字的吗? Still, we can ask how a neural net distinguishes digits—and this gives an indication: Can we say “mathematically” how the network makes its distinctions? 不是真的。 Not really. 它只是“做神经网络做的事情”。 It’s just “doing what the neural net does”. 但事实证明,这通常似乎与我们人类所做的区分相当一致。 But it turns out that that normally seems to agree fairly well with the distinctions we humans make. 让我们举一个更详细的例子。 Let’s take a more elaborate example. 假设我们有猫和狗的图像。 Let’s say we have images of cats and dogs. 我们有一个经过训练的神经网络来区分它们。 And we have a neural net that’s been trained to distinguish them . 下面是它在一些例子中的作用:现在更不清楚什么是“正确答案”了。 Here’s what it might do on some examples: Now it’s even less clear what the “right answer” is. 那一只穿着猫服的狗呢? What about a dog dressed in a cat suit? 等。 Etc. 不管输入是什么,神经网络都会生成一个答案。 Whatever input it’s given, the neural net is generating an answer. 事实证明,要以一种与人类可能做的合理一致的方式来做这件事。 And, it turns out, to do it in a way that’s reasonably consistent with what humans might do. 正如我上面所说的,这不是一个我们可以“从第一原理推导”的事实。 As I’ve said above, that’s not a fact we can “derive from first principles”. 这只是经验之谈,至少在某些领域是正确的。 It’s just something that’s empirically been found to be true, at least in certain domains. 但这是神经网络有用的一个关键原因:它们以某种方式捕捉了一种“类似人类”的做事方式。 But it’s a key reason why neural nets are useful: that they somehow capture a “human-like” way of doing things. 给自己看一张猫的照片,然后问:“为什么那是一只猫?” Show yourself a picture of a cat, and ask “Why is that a cat?”. 也许你会开始说“嗯,我看到了它尖尖的耳朵等等”,但要解释你是如何认出这是一只猫的并不容易。 Maybe you’d start saying “Well, I see its pointy ears, etc.” But it’s not very easy to explain how you recognized the image as a cat. 只是你的大脑知道了。 It’s just that somehow your brain figured that out. 但对于大脑来说,还没有办法(至少到目前为止)“进入内部”,看看它是如何解决问题的。 But for a brain there’s no way (at least yet) to “go inside” and see how it figured it out. 对于(人工)神经网络来说呢? What about for an (artificial) neural net? 当你展示一只猫的照片时,很容易看到每个“神经元”的作用。 Well, it’s straightforward to see what each “neuron” does when you show a picture of a cat. 但即使是得到一个基本的可视化通常也是非常困难的。 But even to get a basic visualization is usually very difficult. 在我们用于上面的“最近点”问题的最后一个网络中,有17个神经元。 In the final net that we used for the “nearest point” problem above there are 17 neurons. 在识别手写数字的网络中,有2190个。 In the net for recognizing handwritten digits there are 2190. 在我们用来识别猫和狗的网络中,有60650只。 And in the net we’re using to recognize cats and dogs there are 60,650. 通常情况下,想象60650维的空间是相当困难的。 Normally it would be pretty difficult to visualize what amounts to 60,650-dimensional space. 但因为这是一个用来处理图像的网络,它的许多神经元层被组织成数组,就像它正在观察的像素数组一样。 But because this is a network set up to deal with images, many of its layers of neurons are organized into arrays, like the arrays of pixels it’s looking at. 如果我们看一个典型的猫形象我们可以代表美国的神经元在第一层的集合导出images-many我们可以很容易地解释为“猫没有背景”,或“猫的轮廓”:通过第十层很难解释发生了什么,但总的来说我们可能会说,神经网络是“挑出某些特性”(其中也许尖尖的耳朵),并使用这些信息来确定的图像。 And if we take a typical cat image then we can represent the states of neurons at the first layer by a collection of derived images—many of which we can readily interpret as being things like “the cat without its background”, or “the outline of the cat”: By the 10th layer it’s harder to interpret what’s going on: But in general we might say that the neural net is “picking out certain features” (maybe pointy ears are among them), and using these to determine what the image is of. 但这些特征是我们为之命名的吗——比如“尖耳朵”? But are those features ones for which we have names—like “pointy ears”? 主要是没有。 Mostly not. 我们的大脑使用相似的特征吗? Are our brains using similar features? 大多数情况下我们都不知道。 Mostly we don’t know. 但值得注意的是,神经网络的前几层,就像我们在这里展示的,似乎挑选出了图像的某些方面(比如物体的边缘),这些方面似乎与我们知道的大脑中第一级视觉处理所挑选出的图像相似。 But it’s notable that the first few layers of a neural net like the one we’re showing here seem to pick out aspects of images (like edges of objects) that seem to be similar to ones we know are picked out by the first level of visual processing in brains. 但假设我们想在神经网络中建立一个“猫识别理论”。 But let’s say we want a “theory of cat recognition” in neural nets. 我们可以说:“看,这个特定的网络做到了”——这立即给了我们一些“问题有多难”的感觉(例如,可能需要多少神经元或层)。 We can say: “Look, this particular net does it”—and immediately that gives us some sense of “how hard a problem” it is (and, for example, how many neurons or layers might be needed). 但至少到目前为止,我们还没有办法对网络正在做的事情进行“叙述性描述”。 But at least as of now we don’t have a way to “give a narrative description” of what the network is doing. 也许这是因为它确实在计算上是不可约的,除了显式地跟踪每一步,没有一般的方法来找出它的作用。 And maybe that’s because it truly is computationally irreducible, and there’s no general way to find what it does except by explicitly tracing each step. 也可能只是我们还没有“搞清楚科学”,还没有确定“自然法则”,让我们能够总结出正在发生的事情。 Or maybe it’s just that we haven’t “figured out the science”, and identified the “natural laws” that allow us to summarize what’s going on. 当我们讨论用ChatGPT生成语言时,我们将遇到同样的问题。 We’ll encounter the same kinds of issues when we talk about generating language with ChatGPT. 同样,我们也不清楚是否有办法“总结它在做什么”。 And again it’s not clear whether there are ways to “summarize what it’s doing”. 但是语言的丰富性和细节(以及我们对语言的体验)可能会让我们比图像走得更远。 But the richness and detail of language (and our experience with it) may allow us to get further than with images.