Have you ever tried to describe the sound of a malfunctioning car engine or imitated the meow of a neighbor’s cat? Vocal imitation, akin to quickly sketching an idea, is a powerful way to convey concepts when words fall short. Instead of drawing with a pencil, we utilize our vocal cords to express sounds that resonate with our experiences. Though it may seem challenging, vocal imitation is an instinctual skill we all possess. Test it out by mimicking the wail of an ambulance siren, a crow’s caw, or the chime of a bell!
Inspired by cognitive science, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have innovated an AI system capable of generating human-like vocal imitations without prior training or exposure to human vocal impressions. This groundbreaking system operates by modeling the mechanics of the human vocal tract to simulate sound production, mirroring the way sounds are shaped by our throat, tongue, and lips.
The research team employed a cognitively-inspired AI algorithm to control their vocal tract model, enabling it to produce imitations based on the unique context in which humans communicate sounds. The model excels at transforming various sounds from our environment into human-like imitations, whether it’s the rustling of leaves, a snake hissing, or an ambulance approaching. Additionally, it can interpret human vocal sounds and reverse-engineer them to identify the corresponding real-world sound, similarly to how some AI systems convert sketches into detailed images—successfully discerning subtle variations, like a human ‘meowing versus hissing.’
In the coming years, this model could pave the way for intuitive sound design interfaces, enhance human-like interactions in virtual environments, and even aid in language learning processes for students.
Co-authored by MIT CSAIL PhD students Kartik Chandra and Karima Ma, along with undergraduate researcher Matthew Caren, the researchers acknowledge that while aesthetic realism often guides visual expression, vocal expression does not necessarily conform to this principle. Chandra explains, “Over the years, advancements in sketching algorithms have unveiled new tools for artists, propelling AI and computer vision, and enhancing our understanding of human cognition. Similarly, our method encapsulates the abstract ways in which humans vocalize sounds, illustrating the art of auditory abstraction.”
“The goal of this project has been to understand and computationally model vocal imitation, recognized as the auditory counterpart to sketching in visual art,” Caren states.
The Three Phases of Imitation Technology
The team designed three intricate versions of their model to better understand and compare with human vocal imitations. They began with a baseline model, aiming for accuracy in replicating real-world sounds, yet this version did not align well with human imitation patterns.
To enhance this, the researchers introduced a second, more “communicative” model. This iteration focuses on distinctive sound attributes. For instance, when mimicking a motorboat, you might exaggerate its engine’s rumble—the most recognizable feature—regardless of the water splashing sounds. While this model improved upon the baseline, the team sought even greater accuracy.
The final model incorporated an additional layer of reasoning. Caren explains, “Imitating sounds involves varying degrees of effort; producing flawless renditions isn’t always feasible.” The refined model takes into account that people naturally avoid overly fast, loud, or extreme pitch imitations in conversation, resulting in more authentic-sounding vocal expressions.
Following the development of their model, the team conducted experiments to evaluate whether human judges rated AI or human-generated vocal imitations more favorably. Remarkably, participants preferred the AI model for its imitations up to 25% of the time overall, favoring it as much as 75% when imitating a motorboat and 50% for a gunshot.
Advancing Expressive Sound Technology
With a passion for blending technology with artistic expression, Caren envisions potential applications for this model to assist artists in conveying sounds to computational systems, support filmmakers by generating contextually nuanced AI sounds, and help musicians swiftly navigate sound databases through imitations—especially for the harder-to-describe sounds!
As they continue refining their model, the team is also exploring its implications across various domains, including language acquisition, the development of speech in infants, and imitation in birds like parrots and songbirds.
However, challenges remain; the model still struggles with certain consonant sounds and cannot yet replicate human vocalizations of songs or speech with the diversity found across languages. Linguistics expert Robert Hawkins notes the complexity of language evolution, emphasizing that words often mimic rather than replicate sounds—like ‘meow’ for a cat’s purr. He highlights the model’s potential to illuminate the intricate processes linking vocal imitation, physiological constraints, and social communication dynamics.
This research, written by Caren, Chandra, Ma, and their colleagues Jonathan Ragan-Kelley and Joshua Tenenbaum, was supported by the Hertz Foundation and the National Science Foundation and was presented at SIGGRAPH Asia in early December.
Photo credit & article inspired by: Massachusetts Institute of Technology