下载每日英语听力APP,解锁双语字幕及更多学习功能!
Seven years ago, back in 2015, one major development in AI research was automated image captioning.
7年前,也就是2015年,人工智能研究的一项重大进展是便是自动图像字幕。
Machine learning algorithms could already label objects in images, and now they learned to put those labels into natural language descriptions.
机器学习算法已经可以给图像中的物体贴上标签,现在它们学会了把这些标签放入自然语言描述中。
And it made one group of researchers curious.
这让一组研究人员很好奇。
What if you flipped that process around?
如果你把这个过程反过来呢?
We could do image-to-text.
我们可以做图像到文本的转换。
Why not try doing text to images and see how it works?
为什么不尝试将文本转换为图像,看看它是如何工作的呢?
It was a more difficult task.
这是一项更困难的任务。
They didn't want to retrieve existing images the way google search does.
他们不希望像谷歌搜索那样检索现有的图像。
They wanted to generate entirely novel scenes that didn't happen in the real world.
他们想要创造出在现实世界中不会出现的全新场景。
So they asked their computer model for something it would have never seen before.
所以他们要求他们的计算机模型提供一些以前从未见过的东西。
Like all the school buses you've seen are yellow.
就像你看到的所有校车都是黄色的。
But if you write "the red or green school bus", would it actually try to generate something green?
但是如果你写“红色或绿色校车”,它真的会生成一些绿色的东西吗?
And it did that.
它做到了。
It was a 32 by 32 tiny image.
这是一个32×32的小图像。
And then all you could see is like a blob of something on top of something.
然后你能看到的就像是一团东西叠在一团东西上面。
They tried some other prompts like "A herd of elephants flying in the blue skies".
他们尝试了一些其他的提示性语言,比如“一群大象在蓝天上飞翔”。
"A vintage photo of a cat." "A toilet seat sits open in the grass field." And "a bowl of bananas is on the table." Maybe not something to hang on your wall but the 2016 paper from those researchers showed the potential for what might become possible in the future.
“一张猫的老式照片。”“草地上的敞开的马桶盖。”“桌子上的一碗香蕉。”也许它们不是那种你能挂在墙上的画作,但这些研究人员2016年的论文显示了其未来可能的可能性。
And uh… the future has arrived.
嗯……未来已经到来。
It is almost impossible to overstate how far the technology has come in just one year.
几乎不可能夸大这项技术在短短一年里取得的进步。
By leaps and bounds. - Leaps and bounds.
突飞猛进。 - 突飞猛进。
Yeah, it's been quite dramatic.
是啊,挺戏剧性的。
I don't know anyone who hasn't immediately been like "What is this? What is happening here?" Could I say like watching waves crashing?
我认识的每个人都会立刻问“这是什么?这其中发生了什么?”我能说像看海浪撞击吗?
Party hat guy. Seafoam dreams. A coral reef. Cubism. Caterpillar. A dancing taco… My prompt is Salvador Dali painting the skyline of New York City.
派对帽子人。海泡石梦境。珊瑚礁。立体主义。毛毛虫。一个在跳舞的塔可……我的提示词是萨尔瓦多·达利在画纽约市的天际线。
You may be thinking, wait AI-generated images aren't new.
你可能会想,人工智能生成的图像并不新鲜。
You might have heard about this generated portrait going for over $400,000 at auction back in 2018. Or this installation of morphing portraits, which Sotheby's sold the following year.
你可能听说过这幅生成的肖像画在2018年的拍卖会上拍出了40多万美元。或者是苏富比在第二年出售的这个变形肖像装置。
It was created by Mario Klingemann, who explained to me that that type of AI art required him to collect a specific dataset of images and train his own model to mimic that data.
它是由马里奥·克里格曼创作的,他向我解释说这种类型的 AI 艺术需要他收集特定的图像数据集,并训练他自己的模型来模仿这些数据。
Let's say, Oh, I want to create landscapes, so I collect a lot of landscape images.
比如说,哦,我想创建风景图,所以我收集了很多风景图片。
I want to create portraits, I trained on portraits.
我想创作肖像,我就在肖像画方面进行训练。
But then the portrait model would not really be able to create landscapes.
但是这样肖像模型就不能真正地创造风景了。
Same with those hyperrealistic fake faces that have been plaguing LinkedIn and Facebook — those come from a model that only knows how to make faces.
同样,那些一直困扰着 LinkedIn 和 Facebook 的超逼真的假脸——它们来自一个只知道如何做鬼脸的模特。
Generating a scene from any combination of words requires a different, newer, bigger approach.
从任何文字组合中生成场景都需要一种不同的、更新的、更强大的方法。
Now we kind of have these huge models, which are so huge that somebody like me actually cannot train them anymore on their own computer.
现在我们有了这些巨大的模型,它们是如此之大,以至于像我这样的人不能再用自己的电脑训练它们了。
But once they are there, they are really kind of — they contain everything.
但是一旦它们存在于那里,它们就真的包含了一切。
I mean, to a certain extent.
我的意思是,在某种程度上。
What this means is that we can now create images without having to actually execute them with paint or cameras or pen tools or code.
这意味着我们现在可以创建图像,而不需要实际使用油漆、相机、钢笔工具或代码来执行它们。
The input is just a simple line of text.
只用输入一行简单的文本。
And I'll get to how this tech works later in the video; but to understand how we got here, we have to rewind to January 2021. That's when a major AI company called Open AI announced DALL-E — which they named after these guys.
我将在后面的视频中介绍这项技术的工作原理,但为了理解我们是如何走到这一步的,我们必须回到2021年1月。当时一家名为 Open AI 的大型人工智能公司公布了 DALL-E 的问世——他们以这些人的名字命名。
They said it could create images from text captions for a wide range of concepts.
他们说,它可以根据文字说明为广泛的概念创建图像。
They recently announced DALLE-2, which promises more realistic results and seamless editing.
他们最近发布了 DALLE-2,承诺会有更真实的结果和无痕编辑。
But they haven't released either version to the public.
但他们还没有向公众公布这两个版本。
So over the past year, a community of independent, open-source developers built text-to-image generators out of other pre-trained models that they did have access to.
因此,在过去的一年里,一个独立的开源开发者社区利用他们能够访问的其他预先训练的模型构建了文本到图像的生成器。
And you can play with those online for free.
你可以在网上免费使用。
Some of those developers are now working for a company called Mid journey, which created a Discord community with bots that turn your text into images in less than a minute.
其中一些开发者现在正在为一家名为 Midjourney 的公司工作,该公司创建了一个 Discord 社区,其中的机器人可以在不到一分钟的时间内把你的文本变成图像。
Having basically no barrier to entry to this has made it like a whole new ballgame.
该社区几乎没有入门门槛,这让它变成了一种全新的游戏。
I've been up until like two or three in the morning.
我经常熬夜到凌晨两三点。
Just really trying to change things, piece things together.
我只是想改变一些事情,把它们拼凑起来。
I've done about 7,000 images. It's ridiculous.
我做了大约7000张图片。这太荒谬了。
MidJourney currently has a wait-list for subscriptions, but we got a chance to try it out.
MidJourney 目前有一个订阅等待列表,但我们有机会去尝试一下。
"Go ahead and take a look." "Oh wow. That is so cool" "It has some work to do. I feel like it can be — it's not dancing and it could be better." The craft of communicating with these deep learning models has been dubbed "prompt engineering".
“看看吧。”“哦哇。这太酷了。”“它还可以有一些改进。我觉得它可以——这不是跳舞,它还可以更好。”与这些深度学习模型进行交流的技术被称为“即时工程”。
What I love about prompting for me, it's kind of really that has something like magic where you have to know the right words for that, for the spell.
对我来说,我喜欢提示的地方是,它真的有点像魔法,你必须知道正确的词,将其作为咒语。
You realize that you can refine the way you talk to the machine.
你意识到你可以改进你和机器说话的方式。
It becomes a kind of a dialog.
它变成了一种对话。
You can say like "octane render blender 3D".
你可以说“显卡渲染 3D”。
"Made with Unreal Engine… certain types of film lenses and cameras… 1950s, 1960s… dates are really good… lino cut or wood cut…" "Coming up with funny pairings, like a Faberge Egg McMuffin." "A monochromatic infographic poster about typography depicting Chinese characters." Some of the most striking images can come from prompting the model to synthesize a long list of concepts.
“用虚幻引擎制作……某些类型的电影镜头和相机……20世纪50年代、60年代……很不错地日子……浮雕或木刻……”“想出有趣的搭配,比如费伯奇蛋松饼。”“一幅单色信息图海报,描绘汉字排版。”一些最引人注目的图像可能来自于促使模型综合一长串概念。
It's kind of like it's having a very strange collaborator to bounce ideas off of and get unpredictable ideas back.
这就像有一个非常奇怪的合作伙伴来征求意见,然后得到一个不可预测的想法。
I love that!
我太爱它了!
My prompt was "chasing seafoam dreams," which is a lyric from the Ted Leo and the Pharmacists' song "Biomusicology." Can I use this as the album cover for my first album?
我的主题是“追逐海洋泡沫的梦想”,这是泰德·里奥和药剂师(the Pharmacists)合唱的《Biomusicology》中的歌词。我能用这个做我第一张专辑的封面吗?
"Absolutely." - Alright.
“当然可以”。 - 好的。
For an image generator to be able to respond to so many different prompts, it needs a massive, diverse training dataset.
为了让图像生成器能够响应如此多不同的提示词,它需要一个庞大、多样的训练数据集。
Like hundreds of millions of images scraped from the internet, along with their text descriptions.
比如从网上搜集的数亿张图片,以及它们的文字描述。
Those captions come from things like the alt text that website owners upload with their images, for accessibility and for search engines.
这些标题来自于网站所有者上传图片时的 alt 文本,以方便访问和搜索引擎。
So that's how the engineers get these giant datasets.
这就是工程师们获得这些巨大数据集的方式。
But then what do the models actually do with them?
但是这个模型实际上是怎么处理它们的呢?
We might assume that when we give them a text prompt, like "a banana inside a snow globe from 1960." They search through the training data to find related images and then copy over some of those pixels.
我们可能会认为,当我们给他们一个文本提示词时,比如“1960年雪花玻璃球里的香蕉”。它们会通过搜索训练数据来找到相关的图像,然后复制其中的一些像素。
But that's not what's happening.
但事实并非如此。
The new generated image doesn't come from the training data, it comes from the "latent space" of the deep learning model.
新生成的图像并不是来自于训练数据,而是来自于深度学习模型的“潜在空间”。
That'll make sense in a minute, first let's look at how the model learns.
这很快就说得通了,首先让我们看看模型是如何学习的。
If I gave you these images and told you to match them to these captions, you'd have no problem.
如果我给你这些图片,并告诉你将它们与这些字幕匹配,你就不会有什么疑问了。
But what about now, this is what images look like to a machine just pixel values for red green and blue.
但是现在呢,这是机器看到的图像,只是红绿蓝的像素值。
You'd just have to make a guess, and that's what the computer does too at first.
你只需要猜一下,电脑一开始也是这么做的。
But then you could go through thousands of rounds of this and never figure out how to get better at it.
但你可能会经历数千次这样的过程,却永远不知道如何做得更好。
Whereas a computer can eventually figure out a method that works — that's what deep learning does.
而计算机最终可以找到一种有效的方法——这就是深度学习的作用。
In order to understand that this arrangement of pixels is a banana, and this arrangement of pixels is a balloon, it looks for metrics that help separate these images in mathematical space.
为了理解香蕉的像素排列,气球的像素排列,它会去寻找有助于在数学空间中分离这些图像的指标。
So how about color?
那么颜色呢?
If we measure the amount of yellow in the image, that would put the banana over here and the balloon over here in this one-dimensional space.
如果我们测量图像中黄色的数量值,在这个一维空间中,香蕉就在这里,气球在这儿。
But then what if we run into this: Now our yellowness metric isn't very good at separating bananas from balloons.
但如果我们遇到这样的情况:我们的黄色指标不能很好地区分香蕉和气球。
We need a different variable.
我们就需要一个不同的变量。
So let's add an axis for roundness.
让我们添加一个圆的轴。
Now we've got a two-dimensional space with the round balloons up here and the banana down here.
现在我们有一个二维空间,上面是圆气球,下面是香蕉。
But if we look at more data we may come across a banana that's pretty round, and a balloon that isn't.
但如果我们看更多的数据,我们可能会得出一个相当圆的香蕉,和一个不圆的气球。
So maybe there's some way to measure shininess.
所以也许有某种方法可以测量亮度。
Balloons usually have a shiny spot.
气球通常有一个发亮的点。
Now we have a three-dimensional space.
现在我们有了一个三维空间。
And ideally, when we get a new image, we can measure those 3 variables and see whether it falls in the banana region or the balloon region of the space.
理想情况下,当我们得到一个新图像时,我们可以测量这三个变量,看看它是落在香蕉区域还是气球区域。
But what if we want our model to recognize, not just bananas and balloons, but… all these other things.
但如果我们想让我们的模型识别……不只是香蕉和气球,而是所有这些东西呢?
Yellowness, roundness, and shininess don't capture what's distinct about these objects.
黄色、圆形和闪光并不能捕捉到这些物体的独特之处。
We need better variables, and we need a lot more of them.
我们需要更好、更多的变量。
That's what deep learning algorithms do as they go through all the training data.
这就是深度学习算法在处理所有训练数据时所做的事。
They find variables that help improve their performance on the task and in the process, they build out a mathematical space with way more than three-dimensions.
他们找到有助于提高他们在任务中的表现的变量,在这个过程中,他们建立了一个远不止三个维度的数学空间。
We are incapable of picturing multidimensional space, but Midjourney's model offered this and I like it.
我们无法描绘多维空间,但 Midjourney 的模型提供了这一个功能,我喜欢它。
So we'll say this represents the latent space of the model.
这表示模型的潜在空间。
And It has more than 500 dimensions.
它有500多个维度。
Those 500 axes represent variables that humans wouldn't even recognize or have names for, but the result is that the space has meaningful clusters.
这500个坐标轴代表了人类甚至不认识或不知道名字的变量,但结果是空间有了有意义的集群。
A region that captures the essence of banana-ness.
一个抓住了香蕉本质的区域。
A region that represents the textures and colors of photos from the 1960s.
代表20世纪60年代照片的纹理和颜色的区域。
An area for snow and an area for globes and snowglobes somewhere in between.
一个区域用来放雪,另一个区域用来放球和雪球。
Any point in this space can be thought of as the recipe for a possible image.
这个空间中的任何一点都可以被认为是一个可能的图像的配方。
And the text prompt is what navigates us to that location.
文本提示将导航到该位置。
But then there's one more step.
但还有一步。
Translating a point in that mathematical space into an actual image involves a generative process called diffusion.
将数学空间中的一个点转化为实际的图像涉及到一个叫做扩散的生成过程。
It starts with just noise and then, over a series of iterations, arranges pixels into a composition that makes sense to humans.
它从噪声波开始,然后经过一系列迭代,将像素排列成对人类有意义的构图。
Because of some randomness in the process, it will never return exactly the same image for the same prompt.
由于过程中的一些随机性,对于相同的提示,它永远不会生成完全相同的图像。
And if you enter the prompt into a different model designed by different people and trained on different data, you'll get a different result.
如果你把提示输入不同的模型,由不同的人设计,根据不同的数据训练,你就会得到不同的结果。
Because you're in a different latent space.
因为你在一个不同的潜在空间。
No way. That is so cool. What the heck?
不可能吧!太酷了。到底是什么?
They're like brush strokes, the color palette. That's fascinating. I wish I could like — I mean he's dead, but go up to him and be like, "Look what I have!" "Oh that's pretty cool.
有笔触效果,还有它的调色。太迷人了。我希望我能——我的意思是他已经去世,但我可以走到他面前,对他说,“看看我画出了什么!”“太酷了。
Probably the only Dali that I could afford anyways." The ability of deep learning to extract patterns from data means that you can copy an artist's style without copying their images, just by putting their name in the prompt.
这可能是我唯一买得起的达利画作了。”深度学习从数据中提取模式的能力意味着,您可以复制艺术家的风格,而无需复制他们的图像,只需将他们的名字放在提示词中就行。
长度限制无法显示完整,完整版可移步“每日英语听力”APP搜索关键词“AI 作画深度解析”即可查看~
还没有评论,快来发表第一个评论!