OpenAI has unveiled the 'Voice Engine', which creates audio speaking in the speaker's voice from a sample of about 15 seconds in length. It was first created in 2022 and has been tested with small groups to identify uses such as reading, translation, supporting non-verbal users, and improving disabilities.
There are many services that create voices using AI, but their quality is incomparable. It looks like OpenAI, like ChatGPT, has become accustomed to raising the bar to a level where it is useful to the public and disseminating it widely.
You can check out a sample in OpenAI's blog post. If you prepare a 15-second sample and add a paragraph, it will read naturally. It turns short sentences about various fields into audio that contains the characteristics of an individual's voice. Based on samples read in English, you can also generate them in Spanish, Chinese, German, French, or Japanese.
Natural audio can be produced even using samples from people with speech impairment. You can speak the language you need to communicate in everyday life in your own voice. Unlike the existing Text to Speech, it appears to be more convenient to use as it does not feel like the machine is reading it.
Although OpenAI emphasizes that it collaborates with companies that can contribute to human values, such as education and health, it is the most risky technology. Text is inherently easy to copy, so there is concern about its authenticity, and it is still difficult to make video natural enough to completely fake it.
But the voice is different. It's all too easy to get a 15-second sample and much harder for the listener to determine authenticity. It seems that voice authentication when creating a bank account will become difficult in the future. Not long ago, a robocall was circulated in the United States replicating President Joe Biden's voice urging New Hampshire Democrats not to vote.
It is currently only available to limited companies, requires consent from the voice provider when creating the voice, and discloses to listeners that the voice was created by AI. It is said that watermarks are added to audio clips to reveal the source and track distribution.
But I don't think this is enough. There seems to be an urgent need to prepare for the diversity crisis that eliminates differences between individuals and forces people to pursue uniform answers, as well as the protection of invisible private property.