Audio Simulation of the Late Anthony Bourdain's Voice Raises Questions About AI Ethics

Melissa Harris-Perry: It's The Takeaway, I'm Melissa Harris-Perry in for Tanzina Vega. In February, a series of videos featuring a man who looked a lot like Tom Cruise popped up on TikTok.

Speaker 1: What's up, TikTok? You guys cool if I play some sports?

Melissa Harris-Perry: Now, to be clear, that was not actually Tom Cruise. It was a video put together by a special effects professional to start a Cruise impersonator. But for people just going through TikTok, how are they supposed to know that the footage was actually what's known as a deepfake? In recent years, deepfake videos have become increasingly convincing, but in many of those videos, the audio has felt, well, fake.

Then last week, a debate was ignited about the ethics of using artificial intelligence to create convincing audio that a person never actually said. New Yorker writer Helen Rosner published an article revealing that for the new documentary Roadrunner, filmmaker Morgan Neville had used AI technology to create audio simulating the voice of the late Anthony Bourdain, reading an email that he wrote to a friend. The decision was sharply criticized by many on social media, particularly because viewers of the film would not have known that the audio was manipulated if Neville hadn't disclosed the fact during his interview.

In response to Rosner's initial questions, Neville said, "We can have a documentary ethics panel about it later." Well, I'm thinking maybe we should have an ethics panel now. Audio manipulation isn't exactly new. Remember the year that the tear-jerker track of the late Nat King Cole posthumously singing a duet with his daughter, Natalie?

Nat King Cole: That's why, darling, it's incredible

That someone so unforgettable

Thinks that I am

Unforgettable too.

Melissa Harris-Perry: When the departed Nat King Cole harmonizes with the beloved daughter, I get butterflies in my stomach, but when the voice of departed Anthony Bourdain is chopped into place to make him speak words that he never spoke, well, I get a little queasy in my stomach. What are the ethics of audio and what responsibilities do creators have in presenting this work to the public? Here to help me answer that question is Karen Hao, senior AI editor at the MIT Technology Review. Great to have you here, Karen.

Karen Hao: Thank you so much for having me, Melissa.

Melissa Harris-Perry: Before we get to the ethics, can you explain to me how this kind of AI works?

Karen Hao: Yes. You brought up some great examples of deepfakes, both video, and audio. Deepfake, it's a very broad category of synthesized media that is specifically generated by artificial intelligence. Back in the day, if you wanted to touch up a video, if you wanted to touch up a photo or touch up some audio, you're manually editing things in Photoshop or you're manually mixing the sound to try and tweak and tune different elements of it to make it sound more like someone or to make it look more like someone. It's a very laborious process, and you need a significant amount of training and be a professional in that space to achieve that high-quality effect.

With AI technology today, you can essentially just take thousands and thousands of audio clips of someone's voice or thousands and thousands of photos of that person's face, feed it into an algorithm, the algorithm learns what makes someone look Melissa-ish or sound Melissa-ish, and then it will start pumping out new pictures, new videos or new audio that really captures the likeness of that person, to the point where today it's really hard to detect sometimes if what you're seeing or what you're hearing is actually real or fake.

Melissa Harris-Perry: That's a fascinating technology. I could see how it could be valuable, if we had informed consent, if we had a living subject, who said, "All right, I wrote this as an email. I've never actually said it, but sure, take time to use the AI to make that voice." I can even imagine for some folks who lose their voice related to disease or disability, that it's a way to produce a voice of them that sounds closer to their own voice before it's lost, but it feels really different when someone can't give consent for it.

Karen Hao: I think you touch on one of the core ethical challenges here. I think there's two that come into play when we're talking about the Bourdain example. One is consent. Obviously, this was done posthumously. There's a question of, "Okay, well, did he really want to portray himself that way?" Sure, he wrote that email, those are his words, but in the same way that a writer might write a piece of fiction and then they have a particular vision in their mind for how that would be presented, it feels a little weird that someone just took creative license to present his words in the way that they wanted, not the way that he wanted.

I feel like the second issue is disclosure. The public was feeling really riled up, in part, because they would have never found out that this was fake audio unless Helen Rosner had actually asked Neville how he actually obtained the audio of the email. Then it came out through the interview that he had used the synthetic method. Because we're treading very new territory here, there are very few examples of this technology being used in mainstream media production. I think we're just starting to learn how much people can really care about the consent of a person and the disclosure to the audience, those two elements.

Melissa Harris-Perry: I keep thinking, there is a political possibility here. If I could just make former President Trump or current President Joe Biden say anything because I've got plenty of audio to feed into my algorithm, that strikes me as really dangerous.

Karen Hao: Absolutely. I think that is one of the most terrifying aspects of deepfakes. There's definitely a very big worry that even if we have these technologies developed by companies that are responsible, that will only take on certain projects after they've vetted it for ethics, the technology already exists out there and anyone can use it at any time. That means bad actors as well. There's both a concern of political misinformation, where we see or hear video and audio of politicians saying something, believe that it's true and it's actually completely fake, and there's also the flip concern that we see something that's completely real, but because this technology exists, we now start to doubt its veracity and believe that it might be fake.

This was really interesting because I don't know if people remember, when the Hollywood Access tape came out, probably a few months after that came out, Trump actually said, "Oh, well, these days there are so many fake voice technologies out there now." He tried to evoke the idea that it might have been synthesized. Both of those situations are rather dangerous because they slowly erode away our trust of media.

Melissa Harris-Perry: All right, Karen, I want you to just take a minute and listen to something that the crew and I came up with this morning.

[music]

T-pain: Baby girl, what's your name?

Melissa Harris-Perry: I'm Melissa Harris-Perry.

T-pain: Let me buy you a drink.

Melissa Harris-Perry: Listen, wait, I don't think this is the time.

T-pain: I'm T-Pain, you know me.

Melissa Harris-Perry: Okay, T, that's all good, but right now I'm guest-hosting The Takeaway.

Lyrics: Buy you a drink.

Melissa Harris-Perry: Look, we're all just having a little fun here with T-Pain's 2008 smash hit Buy You a Drink, and it's part of the Netflix series, This is Pop. T-Pain revealed suffering years of depression after an encounter with the pop singer Usher who claimed that because T-Pain makes use of the pitch correction software Auto-Tune--

T-Pain: He was like, "Man, you kind of fucked up music."

Melissa Harris-Perry: Karen, we were just talking about this notion of plausible deniability or realism. I know it seems odd, but I want to go to the T-Pain Auto-Tune for a moment, because part of what happens with T-Pain is he turns the Auto-Tune up so much that you know there is a distortion happening, but so much of how Auto-Tune is actually used is to make us believe that what we're hearing is someone singing in pitch when in fact, they're not.

Karen Hao: This is such a great example. I haven't actually thought about-- Yes, that's true. The fact that we're synthesizing audio today is an extension of many of the editing audio technologies that we've had before. I want to go back to images as an analog because when deepfakes first came out, people thought, "Oh my gosh, now that we have these synthetic faces that look identical to the original person, this is a completely new technology that we've never seen before." Then a lot of experts came and said, "Actually, no, we've had Photoshop." There are similar ethical questions around Photoshop. It's just that because it's a more familiar technology to us and to the public, we now have general guidelines for understanding how to Photoshop images in a way that doesn't mislead or doesn't completely falsify certain things.

I think it's the same thing with Auto-Tune and deepfake voice technology now, where Auto-Tune was the predecessor, where we were just beginning to play with the stuff. The distinction that you made is very real, which is that both Photoshop and Auto-Tune, there is still a certain element of detection, or because the public is aware of these things, they know to look out for it. With deepfakes, not only do people not really know that this technology exists right now, but it's also so much harder to actually figure out whether or not it's been used. There's a completely different set of norms that we're going to have to start developing for filmmakers, or musicians, or whatever to figure out, what is the actual responsible way of disclosing this to the public?

Melissa Harris-Perry: I appreciate the language you're using of norms, of guidelines because this doesn't particularly feel to me like the thing that can be legislated. At least not yet. This is a swiftly moving technology. This is really about those folks who have, what is in fact, a pretty incredible power with audio. How do norms arise when we think about something like artificial intelligence and how it gets used in media spaces.

Karen Hao: I feel like because AI moves so fast, the norms really do arise by just trial and error. In this example, Neville trialed something, and then clearly the public did not take it well. Now we know that there's probably some norm that was crossed and we should refine what that boundary is. There's an interesting example of deepfakes being used at a different documentary. This time it was deepfake faces, not voice. It was the 2020 documentary called Welcome to Chechnya. It's about these LGBTQ activists, in Chechnya that are trying to push for human rights and hold the government accountable to human rights abuses against the queer community.

The documentary filmmaker wanted to tell these people's stories, but was caught in this moral dilemma because if he showed their faces, they could be at risk of being killed for their work. He used deepfake faces to actually change their face, but it still retained their emotional expressiveness, it still retained the authenticity of their day-to-day real-life experiences. What he did was, he left a little bit of a halo or glowing effect around the people whose faces were altered, and also, I believe, said in the beginning of the documentary, a disclosure that most of the characters in the film were speaking with faces that were not their own.

In that case, it was very widely celebrated. Again, he trialed something and now, it did go well. We now understand, "Okay, this might be a possible solution moving forward for other filmmakers who are trying to experiment the same technology." I think we're still at such a nascent stage that it's really just about trialing responsibly and talking with different people, different stakeholders to make sure that you understand what the different questions or different concerns that will arise every time you use the technology.

Melissa Harris-Perry: It may be nascent, but I do feel like you're beginning to map for us some key issues. Disclosure, consent, purpose of the youth. Is it for a creative license or is it to advance something that is problematic or nefarious? Then also this language of collaboration, that it's not a decision an individual makes, but one that is made as part of a open process. I feel like all of that is a useful way for us thinking about audio ethics more broadly, whether it's about pop music or documentaries or what we do on the radio every day.

Karen Hao: Absolutely. I think one of the things, when you engage with such new technologies, is you need to be attentive to like the different conversations that are happening in the public and in expert communities who have been studying these technologies or interacting with these technologies for a while. This is certainly a nascent technology in the sense that we've never really had deepfake audio that has achieved such high quality and been used in a mainstream setting. It's not emerging in the sense that there have been experts studying this for several years now.

There are certain, previous examples that can be used as analogs to understand, "Okay, this time around, even if the technology has advanced even more, what can we learn from the ethics of Auto-Tune? What can we learn from other audio editing projects that didn't necessarily use AI, but achieve the same effect?" Through engaging in a lot of that discourse, that's how you can make decisions with more wherewithal, I suppose, instead of just secretly making it yourself and then announcing it to the world and hoping that people take it well.

Melissa Harris-Perry: Karen Hao is the senior AI editor at the MIT Technology Review? Karen, thanks so much.

Karen Hao: Thank you so much for having me, Melissa.

Melissa Harris-Perry: Folks, we want to hear from you on this. How do you feel about the emergence of artificial intelligence to doctor audio, video, photo? What concerns do you have? Tell us by recording, a voice message. No, really, you record it in your voice and send it to the takeawaycallers@gmail.com. Again, takeawaycallers@gmail.com.

New York Public Radio transcripts are created on a rush deadline, often by contractors. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of New York Public Radio’s programming is the audio record.

Produced by Ethan Oberman

Hosted by Dr. Melissa Harris-Perry

Produced by PRI and WGBH