This month's article is a long read, but you can listen to me narrate the piece below…
It’s a strange feeling when you realise you’ve been cloned. Hearing a voice that sounds just like yours, saying lines you’ve never said, on a website you don’t recognise having any connection to, is disorienting to say the least. And that’s exactly what happened to me a few weeks ago…
I’m a Voiceover Artist, and there I was on a Wednesday morning enjoying my first cup of coffee and checking my email, when I opened a marketing message from a company I didn’t immediately recognise offering voiceover services. “I must have signed up for their mailing list”, I thought, and clicked through to find out more about them. “Any voice! Any language!” read the blurb. My first reaction was to wonder why I wasn’t already working with them, and I noticed that they had a page for prospective talent like me to sign up. Great, I thought. But I decided to do my homework first and see who else they might already have on their roster, so I clicked through to the voice samples page, selected “English (UK)” (because I’m British) then “Male”, and clicked play.
And there I was! It was obviously me, but very much an “Uncanny Valley” version of me: my vocal tones, but with a slightly odd range of prosody, and a cadence that wasn’t my own. In truth, it was more like a version of me who wasn’t that great a voiceover artist and needed some coaching. And it was very obviously machine generated speech, rather than a recording I’d made. But there the voice was: available to buy, and with no obvious mention anywhere that I was listening to a synthetic voice.
My brow furrowed… how was this possible? Who were this company, and how were they using my voice? I had no recollection of ever agreeing to this, and a cascade of feelings and emotions - ranging from puzzlement, to anger, to betrayal - began to wash over me.
There was one clue: the pseudonym they’d given the voice on the site had jogged a memory: I’d worked with a semi-regular client for several years, and the scripts they sent across had used the same pseudonym at the top of the page. I’d noticed this in the past, but this isn’t always a signal of nefarious behaviour. Some clients, wary of voice talent whom they fear might steal their customers to cut out the middleman, will put this kind of smokescreen in place between the end client and the voice talent to try to make it difficult for the end client to know the voice really is. It’s not something I like, but personally I have no wish to poach anyone’s customers and - if the material is primarily for use within a company and not for wider public consumption, where I’d want to have the recognition of it being me - it’s something I’ve generally turned a blind eye to. After all, if my clients want to work in a culture of fear, that’s up to them. As long as I get paid, and no one’s profiting by dint of the mislabelling, then I’m still putting food on the table after all - even if I acknowledge that my slightly bruised ego doesn’t particularly like it.
Putting my coffee aside and putting on my deerstalker, I delved into my email archive. A quick search for previous correspondence with, well, let’s call them “Acme Voices” (because an NDA prevents me from giving away their real identity) revealed that they’d been bought out a couple of years ago by an AI company (whom I’ll refer to as “Abominable AI”). The email explaining the merger talked about the exciting new opportunities this would mean for voice talent, and urged anyone with questions or concerns to get in touch. The VP listed their contact details at the bottom of the email, so I immediately sent an email, left a voicemail - saying that I’d sent an email and would appreciate a call to talk about events since the merger - and, because I was connected to the VP on LinkedIn, I sent them a LinkedIn message for good measure. (None of these messages have been replied to at time of writing, despite my email tracking telling me the emails were opened.)
Trawling back further, to our original correspondence back in 2016, I found a contract - and an NDA. And then the penny dropped… I’d basically signed away not only the copyright in the recordings I’d supplied, but the right to reuse them in any form, forever. There was even a clause that specifically mentioned use in “TTS” (Text-to-Speech). The NDA added a well-fitting lid to the pot, sealing it all in and requiring that I didn’t talk about the contract or my relationship with the client. Yes, you’re right: I was an idiot. You might even draw the conclusion that I deserve everything I got.
But let’s back up a little. The truth is (regardless of what you may hear to the contrary) I’m not actually an idiot. I’d looked at the work I was doing for the client, which was almost exclusively short telephone prompts (“Thanks for calling. If you’d like technical support, press 1” - that sort of thing) which only ran for a few sentences. In 2016 the potential to reuse these sorts of brief recordings in any way that would have been exploitative was negligible. It wasn’t like these were broadcast ads for Coca-Cola, for example, where I needed to set time limits and renewal terms.
Let’s be realistic: in an ideal world we’d all be able to negotiate every contract to our satisfaction. But the reality is that the larger the client, the more likely it is they’ll have their own contract for you to sign. It’s usually a contract that their legal department has drafted, and producers and managers are, more often than not, unwilling or unable to tweak it for individual requests. Personal experience, multiple times, has taught me that pushing back and requesting revisions often leads to a “no”, so the decision whether to sign a contract or not comes down to doing a little bit of “risk analysis”, as you decide whether you want to work for the client or not. Basically, most of the time you can take the contract “as is” or move along. So, I signed…
In hindsight, it looks very much like this contract was designed to enable exactly this type of future exploitation. Every point of potential objection around reuse was legally covered, leaving me no grounds for complaint. It was all totally legit legally, even if the ethics sucked.
Crucially, and despite the clause about TTS, in 2016 it wasn’t even possible to create a voice model from small samples of audio like this. Seven years ago, TTS models needed hours of purposely written and painstakingly recorded audio to do anything useful (and usually it still came out sounding slightly robotic and artificial at the end of it). But technology has changed in the interim: it’s now possible to take just a minute or so of anyone’s voice and create a workable model that sounds like the original speaker. It works by layering the tones and timbre of a recorded voice over an AI model that’s already programmed to replicate the pacing and prosody of a human voice. That’s why the sample on the site wasn’t quite me: it had my tones, but not my “flow”.
OK, you might think, but if it’s not a great-sounding model, and it doesn’t really sound like me, then what’s the harm - other than the idea that someone’s pimping out a voice that’s basically mine and not paying me for it?
Well, what happens when that technology gets better (more on that in a moment) and the voice is given a script to read that’s at odds with my own moral code? Something that I’d decline to record myself… something that’s politically extreme… what about hate speech? What if it’s used to “phish” for financial misdeeds? Abominable AI would, of course, claim that they have safeguards in place, and that they wouldn’t allow such misuse. But as we’ve seen with social media companies, expecting corporations to act responsibly as judge and jury around online behaviour and ethics is a little like leaving the fox in charge of the henhouse. Unless we know and trust the company concerned and have explicitly signed away the right to allow what our “clones” say to be policed by them, I’d contend that the final arbiters of what should be spoken in our voices should always be ourselves.
But in some ways the whole argument about assigning rights to recordings is, at this point, moot. Like many narrators, I have countless of hours of audiobook material out there that can be harvested. We’re beginning to see untenable justifications for AI developers having used audio that they’re quick to point out is “publicly available”, but we need to remember that publicly available isn’t the same thing as being in the public domain, or free for commercial use. We already have a term for this: it’s called copyright theft. Plagiarism for profit has met its match in court many times, with successful lawsuits around illegal bootlegging and sampling in music being recent examples. We’re also beginning to see fake “auditions” posted on online casting sites that are supposedly for business usage, but where it’s obvious from reading the script that what’s really happening is the poster is looking for clean audio from which to create an AI model. For voiceover artists, it’s hard to see a way to prevent any of these types of misuse.
But it’s not just voice talent like me for whom this is a threat. Remember: it’s possible now to take a minute or so of anyone’s speech and model it for AI purposes, with or without their permission. Whoever you are, your voice can be sampled in a phone call, a Zoom meeting or just about anywhere else at this point and turned into a model of you. If there’s a recording of your voice online, you’re even more of a potential target. What happens when your mother, your brother, your partner gets a call claiming there’s an emergency - from someone who’s apparently you - and in the rush and confusion hands over sensitive information or money to a phishing attack? A schoolfriend of mine, who’s now paid to dream about these sorts of things for a major IT and big data corporation, told me the other day that it’s already possible for someone with the correct lack of scruples to offer “Phishing as a Service” if they wanted to. A Generative AI chatbot, he told me, connected to an AI speech model, can hold a conversation with you in real time. And (here’s the moment where my jaw hit the floor) it can do a better job of it than someone in a foreign call centre who speaks English as a second language.
The argument around artifice - the idea that you can tell it’s an AI - will also soon be moot. A colleague who’s worked with creating legitimate TTS models for some years told me that this new cadre of Generative AI voice models are amazingly realistic and natural sounding. “Forget what you think you know from listening to Siri and Alexa”, he said. The truth is, you’ll never be able to tell, at least in the context of a conversation, that you’re not talking to a real human being.
So, what can we do and what can we learn here? In some ways, I appreciate that this story - apart from being a cautionary tale - raises more questions than there are currently answers. I can almost hear the pennies dropping in the minds of some readers, who may be realising they’ve signed away their rights in similar circumstances to myself in the past. Anyone who’s signed up with an online casting site, voice directory or production company in the last few years ought to be checking those contracts very carefully at this point. And obviously, anyone who’s asked to sign a contract for voiceover services going forward should, at the very least, be checking the terms and - with the benefit of knowing what’s now possible - pushing back more firmly against terms that might enable later misuse. Those periodic updates to Terms and Conditions, which we’re primed to gloss over and disregard, might just be worth reading after all.
Some of us are beginning to add an “AI rider”, like the one that NAVA (the National Association of Voice Actors) has on its website, to our paperwork. Part of that conversation will likely need to include educating the client about where this leaves talent like us exposed. But there’s also a line to tread here, to avoid pointing fingers pre-emptively at innocent people who have no desire to do anything iniquitous with our recordings, or running around sounding like Chicken Licken (or Henny Penny, depending on where you grew up) and claiming that the sky is falling in. After all, most clients are decent; not everyone is out to steal your voice; and we do need to guard against allowing ourselves to live in fear.
Then again, what happens when - as in my case - the company you sign away your rights to gets bought out by another company with other ideas? We do also need to be aware that there are forces at work which aren’t adhering to the gentleman’s agreement. I’ve called Abominable AI “developers” here, but the truth is that many of these companies are much larger than the couple of enterprising nerds the word might initially suggest. These startups are often funded by venture capitalists, with the money on the table – sometimes running to millions of dollars – enough to make all but the most principled of voiceover production company and voice directory owners to take the moral high ground. What’s happened to me, and to the other people on Abominable’s site, has doubtless happened already to others, and will continue to happen.
It's said that hindsight is 20:20 vision, and looking again at my original contract it’s hard to discount the idea that Acme Voices were setting themselves up to be taken over. For all I know they were tying everything up neatly ahead of time, then hawking themselves around AI companies (of whom there are thousands) offering their library for exploitation by them. And on the other side, there are doubtless companies out there buying up similar libraries of voice recordings, like Acme’s, with their rights already assigned and their talent blissfully unaware of what they’ve unwittingly enabled. An audiobook producer, Findaway Voices, was recently called out by members of its own narrator pool for a clause in its terms and conditions that many had missed, which allowed Apple to use recordings “for machine learning training and models”. Findaway was acquired by Spotify last June (remember what I said about big money?) Even users of popular audio recording software are beginning to notice clauses which permit their supposedly private recordings to be harvested for such purposes – particularly if the audio is stored on the cloud or processed remotely. We should all, it seems, be taking more trouble to read those pesky T&Cs…
From a legal point of view, how do we consider the legitimacy of a contract where one party may have knowingly misled the other party – the second party making a judgement based on the state of technology at the time of signing versus technology a few years down the line, and which the first party knew was coming? (It’s not unlike the world of insider trading…) Is an NDA that prevents someone like me from “whistleblowing” - i.e., telling my colleagues about who the client really is, so that they can check whether their own voice has been taken and modelled, so that we might collectively organise to challenge it in the courts - really a fair contract? And as the technology reaches a point where telling genuine speech from AI speech becomes difficult, if not impossible, who would be liable if someone made my clone read hate speech or slandered someone publicly – and how might I prove in court that I hadn’t made the recording myself?
It’s clear to me that where we are now is just the tip of the iceberg regarding AI in relation to moral conduct, copyright theft and more. My friend and voiceover colleague, Bev Standing, settled out of court with TikTok after the social media giant began selling a model of her voice without consent. Getty Images is, at time of writing, suing a company called Stability AI, claiming it unlawfully scraped millions of images from its site for reuse by generative AI. And AI developer, ElevenLabs, is fighting a rear-guard action after deepfakes generated by its technology made an AI version of actor, Emma Watson, read Adolf Hitler’s “Mein Kampf”, while another made an AI version of President Biden make sexist and transphobic comments. And this is before we even get into deepfake videos…
It seems we’re in the Wild West here, and in a territory where things are moving very fast indeed. What happens when developers begin offering “blended” voices, by taking different samples and mixing them, so it’s no longer clear whose voice the model was based on? (In some ways this might actually help, as it would create potentially fewer conflicts over attribution and liability.) I foresee a time where you’ll be able to go onto a website and – using something akin to the graphic equaliser on your old hi-fi system – move sliders for pitch, pace, prosody, projection, accent and more, to generate a completely new voice in real time – and get it to say whatever you want it to.
This morning, as I was preparing to sit down and write this piece, another AI developer (with whom I’ve openly been working for some time and on equitable terms) sent me a clip from yet another website. And there I am again, or at least another slightly drunken-sounding version of me. In this case, so far, I have no idea of the provenance of this one or how it got there. But having been at this voiceover thing for some time, it seems that having Google surface my name whenever someone searches for “British male voiceover artist” may have become as much a curse as it is a blessing in terms of my voice clips being “found”.
One thing is clear: when it comes to AI, copyright, and ethics, the horse is very much out of the stable. In case of doubt, and as a horse owner, I can tell you that a loose horse is a very dangerous thing…
Send in the clones? Don’t bother, they’re here.