科学美国人Scientific American EP377 |你比一台机器更能识别假的吗?

科学美国人Scientific American EP377 |你比一台机器更能识别假的吗?

00:00
11:49

Sarah Vitak: This is Scientific American’s 60 Second Science. I’m Sarah Vitak. 


Early last year a TikTok of Tom Cruise doing a magic trick went viral.


[CLIP: Deepfake of Tom Cruise says, “I’m going to show you some magic. It’s the real thing. I mean, it’s all the real thing.”]


Vitak: Only, it wasn’t the real thing. It wasn’t really Tom Cruise at all. It was a deepfake.


Matt Groh: A deepfake is a video where an individual's face has been altered by a neural network to make an individual do or say something that the individual has not done or said.


Vitak: That is Matt Groh, a Ph.D. student and researcher at the M.I.T. Media Lab. (Just a bit of full disclosure here: I worked at the Media Lab for a few years, and I know Matt and one of the other authors on this research.)


Groh: It seems like there’s a lot of anxiety and a lot of worry about deepfakes and our inability to, you know, know the difference between real or fake.


Vitak: But he points out that the videos posted on the Deep Tom Cruise account aren’t your standard deepfakes.


The creator, Chris Umé, went back and edited individual frames by hand to remove any mistakes or flaws left behind by the algorithm. It takes him about 24 hours of work for each 30-second clip. It makes the videos look eerily realistic. But without that human touch, a lot of flaws show up in algorithmically generated deepfake videos.


Being able to discern between deepfakes and real videos is something that social media platforms in particular are really concerned about as they need to figure out how to moderate and filter this content.


You might think, “Okay, well, if the videos are generated by an AI, can’t we just have an AI that detects them as well?”


Groh: The answer is kind of yes but kind of no. And so I can go—you want me to go into, like, why that? Okay, cool. So the reason why it’s kind of difficult to predict whether video has been manipulated or not is because it’s actually a fairly complex task. And so AI is getting really good at a lot of specific tasks that have lots of constraints to them. And so AI is fantastic at chess. AI is fantastic at Go. AI is really good at a lot of different medical diagnoses, not all, but some specific medical diagnoses, AI is really good at. But video has a lot of different dimensions to it. 


Vitak: But a human face isn’t as simple as a game board or a clump of abnormally growing cells. It’s three-dimensional, varied. Its features create morphing patterns of shadow and brightness. And it’s rarely at rest.


Groh: And sometimes you can have a more static situation, where one person is looking directly at the camera, and much stuff is not changing. But a lot of times people are walking. Maybe there’s multiple people. People’s heads are turning.


Vitak: In 2020 Meta (formerly Facebook) held a competition where they asked people to submit deepfake detection algorithms. The algorithms were tested on a “holdout set,” which was a mixture of real videos and deepfake videos that fit some important criteria.


Groh: So all these videos are 10 seconds. And all these videos show actors, unknown actors, people who are not famous in nondescript settings, saying something that’s not so important. And the reason I bring that up is because it means that we’re focusing on just the visual manipulations. So we’re not focusing on “Do”—like, “Do you know something about this politician or this actor?” and, like, “That’s not what they would have said. That's not like their belief” or something. “Is this, like, kind of crazy?” We’re not focusing on those kinds of questions.


Vitak: The competition had a cash prize of $1 million that was split between top teams. The winning algorithm was only able to get 65 percent accuracy.


Groh: That means that 65 out of 100 videos, it predicted correctly. But it’s a binary prediction. It’s either deepfake or not. And that means it’s not that far off from 50–50. And so the question then we had was, “Well, how well would humans do, relative to this best AI on this holdout set?”


Groh and his team had a hunch that humans might be uniquely suited to detect deepfakes—in large part because all deepfakes are videos of faces.


Groh: People are really good at recognizing faces. Just think about how many faces you see every day. Maybe not that much in the pandemic, but generally speaking, you see a lot of faces. And it turns out that we actually have a special part in our brains for facial recognition. It’s called the fusiform face area. And not only do we have this special part in our brain, but babies are even—like, have proclivities to faces versus nonface objects. 


Vitak: Because deepfakes themselves are so new (the term was coined in late 2017) most of the research so far around spotting deepfakes in the wild has really been about developing detection algorithms: programs that can, for instance, detect visual or audio artifacts left by the machine-learning methods that generate deepfakes. There is far less research on human’s ability to detect deepfakes. There are several reasons for this, but chief among them is that designing this kind of experiment for humans is challenging and expensive. Most studies that ask humans to do computer-based tasks use crowdsourcing platforms that pay people for their time. It gets expensive very quickly.


The group did do a pilot with paid participants but ultimately came up with a creative, out-of-the-box solution to gather data.


Groh: The way that we actually got a lot of observations was hosting this online and making this publicly available to anyone. And so there’s a Web site, detectfakes.media.mit.edu, where we hosted it, and it was just totally available and there were some articles about this experiment when we launched it. And so we got a little bit of buzz from people talking about it; we tweeted about this. And then we made this. It’s kind of high on the Google search results when you’re looking for deepfake detection and just curious about this thing. And so we actually had about 1,000 people a month come visit the site.


Vitak: They started with putting two videos side by side and asking people to say which was a deepfake.


Groh: And it turns out that people are pretty good at that, about 80 percent on average. And then the question was “Okay, so they’re significantly better than the algorithm on this side-by-side task. But what about a harder task, where you just show a single video?”


Vitak: Compared on an individual basis with the videos they used for the test, the algorithm was slightly better. People were correctly identifying deepfakes around 66 to 72 percent of the time,whereas the top algorithm was getting 80 percent.


Groh: Now, that’s one way. But another way to evaluate the comparison—and a way that makes more sense for how you would design systems for flagging misinformation and deepfakes—is crowdsourcing. And so there’s a long history that shows when people are not amazing at a particular task or when people have different experiences and different expertise, when you aggregate their decisions along a certain question, you actually do better than the individuals by themselves.


Vitak: And they found that the crowdsourced results actually had very similar accuracy rates to the best algorithm.


Groh: And now there are differences again, because it depends what videos we’re talking about. And it turns out that, on some of the videos that were a bit more blurry and dark and grainy, that’s where the AI did a little bit better than people. And, you know, it kind of makes sense that people just didn’t have enough information, whereas there’s the visual information that was encoded in the AI algorithm. And, like, graininess isn’t something that necessarily matters so much, they just—the AI algorithm sees the manipulation, whereas the people are looking for something that deviates from your normal experience when looking at someone—and when it’s blurry and grainy and dark—your experience already deviates. So it’s really hard to tell. But the thing is, actually, the AI was not so good on some things that people were good on.


Vitak: One of those things that people were better at was videos with multiple people. And that is probably because the AI was “trained” on videos that only had one person.


And another thing that people were much better at was identifying deepfakes when the videos contained famous people doing outlandish things. (Another thing that the model was not trained on). They used some videos of Vladimir Putin and Kim Jong-un making provocative statements. 


Groh: And it turns out that when you run the AI model on either the Vladimir Putin video or the Kim Jong-un video, the AI model says it’s essentially very, very low likelihood that’s a deepfake. But these were deepfakes. And they are obvious to people that they were deepfakes or at least obvious to a lot of people. Over 50 percent of people were saying, “This is, you know, this is a deepfake.”


Vitak: Lastly, they also wanted to experiment with trying to see if the AI predictions could be used to help people make better guesses about whether something was a deepfake or not.


So the way they did this was they had people make a prediction about a video. Then they told people what the algorithm predicted, along with a percentage of how confident the algorithm was. Then they gave people the option to change their answers. And amazingly, this system was more accurate than either humans alone or the algorithm alone. But, on the downside, sometimes the algorithm would sway people’s responses incorrectly.


Groh: And so not everyone adjusts their answer. But it's quite frequent that people do adjust their answer. And in fact, we see that when the AI is right, which is the majority of the time, people do better also. But the problem is that when the AI is wrong, people are doing worse.


Vitak: Groh sees this as a problem in part with the way the AI’s prediction is presented. 


Groh: So when you present it as simply a prediction, the AI predicts 2 percent likelihood, then, you know, people don’t have any way to introspect what’s going on, and they’re just like, “Oh, okay, like, the eyes thinks it’s real, but, like, I thought it was fake. But I guess, like, I’m not really sure. So I guess I’ll just go with it.” But the problem is that that’s not how, like, we have conversations as people. Like, if you and I were trying to assess, you know, whether this is a deepfake or not, I might say, “Oh, like, did you notice the eyes? Those don’t really look right to me,” and you’re like, “Oh, no, no, like, that—that person has, like, just, like, brighter green eyes than normal. But that’s totally cool.” But in the deepfake, like, you know, AI collaboration space, you just don’t have this interaction with the AI. And so one of the things that we would suggest for future development of these systems is trying to figure out ways to explain why the AI is making a decision.


Vitak: Groh has several ideas in mind for how you might design a system for collaboration that also allows the human participants to better utilize the information they get from the AI.


Ultimately, Groh is relatively optimistic about finding ways to sort and flag deepfakes—and also about how influential deepfakes of false events will be.


Groh: And so a lot of people know “Seeing is believing.” What a lot of people don’t know is that that’s only half the aphorism. The second half of aphorism goes like this: “Seeing is believing. But feeling is the truth.” And feeling does not refer to emotions there. It’s experience. When you’re experiencing something, you have all the different dimensions that’s, you know, of what’s going on. When you’re just seeing something, you have one of the many dimensions. And so this is just to get up this idea that, you know, that that seeing is believing to some degree. But we also have to caveat it with: there’s other things beyond just our visual senses that help us identify what’s real and what’s fake.


Vitak: Thanks for listening. For Scientific American’s 60 Second Science, I’m Sarah Vitak.

【参考译文】

Sarah Vitak:这里是《科学美国人》的《60秒科学》。我是莎拉-维塔克。


去年年初,汤姆-克鲁斯表演魔术的TikTok走红网络。


[剪辑:汤姆-克鲁斯的Deepfake说,"我要给你看一些魔术。这是真实的东西。我的意思是,这都是真实的东西。"] 。


维塔克:只是,这不是真实的东西。这根本不是真的汤姆-克鲁斯。这是个深层假象。


马特-格罗。深度伪造是一个视频,其中一个人的脸被神经网络改变,使一个人做或说一些他没有做过或说过的话。


维塔克:这就是马特-格罗,他是麻省理工学院媒体实验室的博士生和研究员。(在此只需透露一点情况。我在媒体实验室工作过几年,我认识马特和这项研究的其他作者之一)。


格罗。似乎有很多人对深层假货和我们无法,你知道的,知道真假之间的区别感到焦虑和担心。


维塔克:但他指出,"深度汤姆-克鲁斯 "账户上发布的视频并不是你的标准深度假货。


创建者Chris Umé回过头来,用手编辑各个画面,以消除算法留下的任何错误或缺陷。每个30秒的片段,他要花24小时的时间。这使得视频看起来非常逼真。但是,如果没有这种人情味,很多缺陷就会出现在算法生成的深度伪造视频中。


能够辨别深层造假和真实视频是社交媒体平台特别关注的事情,因为他们需要弄清楚如何调节和过滤这些内容。


你可能会想,"好吧,如果这些视频是由人工智能生成的,我们是不是也可以让人工智能来检测它们?"


Groh。答案是 "是",但 "不是"。所以我可以......你想让我说说,为什么会这样?好的,很好。因此,预测视频是否被篡改有点困难的原因是,这实际上是一项相当复杂的任务。因此,人工智能在很多特定的任务上变得非常出色,这些任务有很多限制。因此,人工智能在国际象棋方面非常出色。人工智能在围棋方面非常出色。人工智能在很多不同的医疗诊断方面非常出色,不是所有的,但一些特定的医疗诊断,人工智能真的很擅长。但视频有很多不同的维度。


维塔克:但人脸并不像游戏盘或一团异常生长的细胞那么简单。它是三维的,多样的。它的特征创造了阴影和亮度的变形模式。而且它很少处于静止状态。


Groh: 有时你可以有一个更静态的情况,一个人直视着摄像机,很多东西都没有变化。但很多时候,人们都在行走。也许有多个人。人们的头在转动。


维塔克:在2020年,Meta(以前的Facebook)举行了一个比赛,他们要求人们提交深度伪造的检测算法。这些算法在一个 "保留集 "上进行测试,这个 "保留集 "是真实视频和符合一些重要标准的深度伪造视频的混合物。


Groh。所以所有这些视频都是10秒。所有这些视频都展示了演员,不知名的演员,不出名的人,在不显眼的环境中,说一些不那么重要的东西。我提出这个问题的原因是,这意味着我们只关注视觉上的操作。所以我们没有关注 "做"--比如,"你知道这个政治家或这个演员的一些情况吗?"以及,比如,"这不是他们会说的。这不像是他们的信仰 "之类的。"这是不是,有点疯狂?" 我们不关注这类问题。


维塔克:比赛有100万美元的现金奖励,由顶级团队分享。获胜的算法只能达到65%的准确率。


格罗。这意味着在100个视频中,它有65个预测正确。但这是一个二元预测。它要么是深度造假,要么不是。而这意味着它与50-50相差不大。因此,我们的问题是,"那么,相对于这个最好的人工智能,人类在这个保留集上的表现会如何?"


格罗和他的团队有一种预感,人类可能是唯一适合检测深层假货的人--很大程度上是因为所有深层假货都是人脸视频。


格罗。人们真的很擅长识别人脸。想想看,你每天看到多少张脸。也许在大流行病中没有那么多,但一般来说,你会看到很多脸。而事实证明,我们的大脑中实际上有一个特殊的部分用于识别面部。它被称为fusiform脸部区域。不仅我们的大脑中有这个特殊的部分,而且婴儿甚至像,对面孔和非面孔物体有倾向性。


维塔克:由于深层假象本身是如此之新(这个术语是在2017年底创造的),迄今为止,围绕在野外发现深层假象的大部分研究实际上是关于开发检测算法:例如,可以检测由产生深层假象的机器学习方法留下的视觉或音频伪影的程序。对人类的研究则少得多。

Vitak:由于深层假货本身是如此之新(该术语是在2017年底提出的),到目前为止,围绕在野外发现深层假货的大部分研究实际上是关于开发检测算法:例如,可以检测由产生深层假货的机器学习方法留下的视觉或音频伪影的程序。关于人类检测深度假货的能力的研究要少得多。这有几个原因,但其中最主要的是,为人类设计这种实验是具有挑战性和昂贵的。大多数要求人类完成基于计算机的任务的研究都使用众包平台,为人们的时间付费。这很快就会变得很昂贵。


该小组确实做了一个付费参与者的试点,但最终想出了一个创造性的、开箱即用的解决方案来收集数据。


Groh: 我们实际上得到了大量的观察结果,这是在网上进行的,并向任何人公开提供。所以有一个网站,detectfakes.media.mit.edu,我们把它放在那里,它是完全可用的,当我们启动它的时候,有一些关于这个实验的文章。因此,我们从人们的谈论中得到了一点嗡嗡声;我们在推特上发布了这个消息。然后我们做了这个。当你寻找深度造假检测和对这件事感到好奇时,它在谷歌搜索结果中排名很高。因此,我们实际上每个月有大约1000人访问这个网站。


维塔克:他们开始时把两个视频并排放在一起,让人们说哪一个是深度伪造的。


Groh: 结果发现人们对这个问题很在行,平均约有80%。然后问题是:"好吧,他们在这个并排的任务上明显比算法好。但在一个更难的任务上,你只是展示一个单一的视频呢?"


维塔克:与他们用于测试的视频进行单独比较,该算法略胜一筹。人们大约有66%到72%的时间能正确识别深层假货,而顶级算法则有80%的时间。


Groh。现在,这是一种方法。但另一种评估比较的方式--也是对你如何设计标记错误信息和深度假货的系统更有意义的方式--是众包。因此,有很长的历史表明,当人们在某项任务上并不出色,或者当人们有不同的经验和不同的专业知识时,当你沿着某个问题汇总他们的决定时,你实际上比个人本身做得更好。


维塔克:而且他们发现,众包的结果实际上与最佳算法的准确率非常相似。


格罗。而现在又有了区别,因为这取决于我们在谈论什么视频。结果发现,在一些比较模糊、黑暗和有颗粒感的视频上,人工智能做得比人好一点。而且,你知道,这有点说得通,人们只是没有足够的信息,而有视觉信息被编码在AI算法中。而且,比如说,颗粒度并不是一定很重要的东西,他们只是--人工智能算法看到了操纵,而人们在寻找一些偏离你正常经验的东西,当它是模糊的、颗粒的和黑暗的时候,你的经验已经偏离了。所以这真的很难说。但问题是,实际上,人工智能在一些人们擅长的事情上并不那么出色。


维塔克:人们更擅长的事情之一是与多人的视频。这可能是因为人工智能是在只有一个人的视频上 "训练 "出来的。


另一件人们更擅长的事情是,当视频中的名人做了奇怪的事情时,人们会识别出深层假象。(这也是模型没有被训练过的另一点)。他们使用了一些普京和金正恩发表挑衅性言论的视频。


Groh。结果发现,当你在普京视频或金正恩视频上运行人工智能模型时,人工智能模型说这基本上是非常、非常低的可能性,这是一个深度伪造。但这些都是深度伪造的。对人们来说,它们是明显的深层造假,或者至少对很多人来说是明显的。超过50%的人在说,"这是,你知道,这是一个深层假货。"


维塔克:最后,他们还想尝试一下,看看人工智能的预测是否可以用来帮助人们更好地猜测某个东西是否是深度伪造的。


所以他们的方法是让人们对一个视频进行预测。然后他们告诉人们该算法的预测,以及该算法的自信程度的百分比。然后他们让人们选择改变他们的答案。令人惊讶的是,这个系统比单独的人类或单独的算法都更准确。但是,缺点是,有时算法会错误地左右人们的反应。


格罗。所以不是每个人都会调整他们的答案。但人们经常调整他们的答案。事实上,我们看到,当人工智能是正确的,也就是大多数时候,人们也做得更好。但问题是,当人工智能是错误的时候,人们做得更糟。


维塔克:格罗认为这在一定程度上是人工智能的预测方式的问题。


格罗。所以,当你把它作为一个简单的预测,人工智能预测了2%的可能性,那么,你知道,人们没有任何方法来反省发生了什么,他们只是想,"哦,好吧,比如,眼睛认为这是真的,但是,比如,我认为这是假的。但我想,就像,我不是真的确定。所以我想我就随它去吧。" 但问题是,这不是,比如,我们作为人的对话方式。就像,如果你和我试图评估,你知道,这是否是一个深度伪造,我可能会说,"哦,比如,你注意到眼睛吗?那些对我来说真的看起来不对,"而你会说,"哦,不,不,就像,那个人有,就像,就像,比正常的绿色眼睛更亮。但那是完全很酷的。" 但是在深层造假,比如,你知道,人工智能协作空间,你只是没有与人工智能的这种互动。因此,我们对这些系统的未来发展提出的建议之一是,试图找出解释人工智能为什么要做出决定的方法。


维塔克:对于如何设计一个协作系统,使人类参与者能够更好地利用他们从人工智能那里获得的信息,格罗心中有几个想法。


最终,Groh对找到分类和标记深层假象的方法相对乐观--也对深层假象的虚假事件会有多大影响持乐观态度。


格罗。所以很多人都知道 "眼见为实"。很多人不知道的是,这只是箴言的一半。箴言的后半部分是这样的。"眼见为实。但感觉才是真理"。而感觉并不是指那里的情绪。它是经验。当你在体验某事时,你有所有不同的维度,你知道,正在发生什么。当你只是看到一些东西时,你有许多维度中的一个。所以这只是为了树立这样的观念,你知道,在某种程度上,眼见为实。但我们也要注意:除了我们的视觉感官之外,还有其他东西可以帮助我们识别什么是真实的,什么是假的。


维塔克:谢谢你的聆听。在《科学美国人》的60秒科学节目中,我是Sarah Vitak。



以上内容来自专辑
用户评论

    还没有评论,快来发表第一个评论!