Researchers use AI to identify a new way mothers change voice when talking to babies

Who’s a good scientist? You’re a good scientist! Yes you are!

That’s what we say to the group of Princeton university neuroscientists that discovered a new feature of how mothers shift their voice when they speak to infants.  In a paper published today in the journal Current Biology, the team reports that mothers across 10 different languages shift the timbre of their voice in similar ways when talking to their babies. This finding will help researchers understand what kind of speech keeps a baby’s attention and helps her learn.

Timbre is the flavor of music and speech. It’s not a distinct pitch or loudness, but rather the unique collection of frequencies produced by a person or instrument. Timbre is what makes sound distinct: It’s why you can tell a violin from a guitar even if they are playing the same note, or Bob Dylan from Jimi Hendrix even if they are both singing “All Along the Watchtower”.

Timbre is tied to the physical structure of the object producing the sound. Certain tones resonate more fully on a violin than on a guitar, and that resonance allows overtones to color the sound in different ways. Each person’s voice box is also an instrument with a unique timbre, though it is malleable and can shift slightly.

WATCH: You can see the different resonances due to the shape of objects in this classic experiment called a Chladni plate (mind your ears!).

To imitate the distinct, nasally voice of Donald Duck, says lead author Dr. Elise Piazza, “I might draw back my lips and tighten the back of my throat to create a different tone color.”

It is known that mothers in many languages raise their pitch, slow down their speech and repeat phrases more often when they are trying to attract a baby’s attention. This is known as infant-directed speech, and Piazza and her colleagues wondered if it might cause shifts in timbre as well.

To test this, the team collected snippets of adult-directed and infant-directed speech from 24 mothers as they either talked to an adult interviewer or interacted with their baby. They chose only mothers in order to minimize the range of audio frequencies they had to deal with (though, the team believes the results extend to fathers as well).

“We usually Skype with my parents,” was one phrase spoken to an adult interviewer, while another phrase spoken to an infant was, “Let’s not eat the kitty cat.” You can almost hear the difference just by reading those quips.

LISTEN: Phrases of adult-directed and infant-directed speech from a participant of the study. Credit: Piazza et al.

In fact, the supplemental information section of this paper, which features a compendium of these phrases, is probably the most adorable one ever put together for a scientific paper. Some phrases even rhyme, which makes the whole thing look like a poem.

A sample of infant-directed utterances from Piazza’s paper. Credit: Current Biology.
A sample of infant-directed utterances from Piazza’s paper. Credit: Current Biology.

To quantify the change in timbre, Piazza and her fellow researchers converted the recorded sound into spectrograms, a measure of the strength of audio frequencies over time. These spectrograms were then analyzed by a statistical model that produces something called Mel-frequency cepstral coefficients (MFCC).

An MFCC is a graph that teases out the strength of audio frequencies while taking into account how humans hear sounds. For example, our eardrum cannot distinguish frequencies that are very close together, so we perceive them as one tone. We also perceive loudness differently, with high-pitched and low-pitched tones sounding quieter than middle pitches all played at the same strength.

When used on a voice, an MFCC is like a vocal version of a fingerprint, since it reveals how one’s voice, especially the makeup of its frequencies, its timbre, is perceived by others. No wonder MFCC is considered state-of-the-art for voice recognition software.

An example of a spectrogram. Here, a male voice is saying the phrase “nineteenth century”. Credit: Wikimedia.
An example of a spectrogram. Here, a male voice is saying the phrase “nineteenth century”. Credit: Wikimedia.

Comparing the MFCCs of infant-directed phrases to those of adult-directed phrases, the researchers found a shift in timbre across ten different languages. Piazza says it’s tough to characterize, but “it likely combines several features, such as brightness, breathiness, purity, or nasality.”

Using this data, the team wrote a machine learning algorithm and trained it to use timbre to classify whether a particular phrase was infant-directed or adult-directed.

“We were most surprised that this timbre shift between adult-directed and infant-directed speech exhibited such a consistent pattern across such diverse languages,” Piazza said. “In addition to English, we included Spanish, Russian, Polish, Hungarian, German, French, Hebrew, Mandarin and Cantonese.”

This consistent pattern across languages was picked up by their algorithm even when the training data set only had English phrases. The reverse was also true. When they trained the algorithm with other languages, it was able to use timbre shifts to classify speech in English. This means the shifts in timbre are likely exhibited in many, if not most, languages.

“That classifier is really effective,” Dr. Anne-Michelle Tessier said, a linguist with the University of Michigan’s Center for Human Growth and Development who was not involved with the study. “Are humans quite so effective? I don’t know.” The machine might be better at finding these patterns in language, technology which could be useful for other studies, said Tessier, but it is as hard to tell if infants are as good at picking up on these patterns as the algorithm.

A mother and child play at the Princeton Baby Lab. Credit: Sameer Khan (Fotobuddy Photography).
A mother and child play at the Princeton Baby Lab. Credit: Sameer Khan (Fotobuddy Photography).

Piazza thinks it is likely that we are. “Previous studies have shown that babies can perceive timbre differences between musical instruments,” she said. “Future work will be needed to determine exactly how babies perceive and use this information, and whether babies can pick up on this shift even in foreign languages.”

Tessier agreed that this research is clear indication that our brain is intimately tuned to discerning language, even from an early age: “Babies are really focused on attending to speech around them, and noticing and storing patterns and distributions in that speech.”

So, is there something deep inside of us, something we are perhaps born with, that helps us focus on and pick up languages even as babies? Are we innately able to latch onto the grammatical interplay of the sounds we humans use to communicate?

This research feels like it is pointing in that direction, but both Piazza and Tessier were reluctant to discuss the idea. They said the research only looked at how mothers produce the sounds and not at how the babies perceived them, meaning we shouldn’t draw conclusions about some sort of universal underlying language structure.

Fair enough. It sure seems relevant, though. This sort of research would help with the fierce ongoing debate over whether something like that, often called “universal grammar”, exists or not. Noam Chomsky invented the idea as a graduate student in the 1950s and turned the field of linguistics upside-down. Chomsky re-laid the foundation for an entire field of science as a twenty-something, his theory of universal grammar was exalted almost to the status of natural law, and he was enshrined as one of the most brilliant intellectuals of contemporary times.

Recently, however, there’s been a backlash against the universal grammar theory, especially after intrepid linguists such as Daniel Everett embedded themselves in hard-to-reach native communities in the Amazon jungle and discovered that their communication doesn’t follow Chomsky’s rules. For an amazing summary of the situation, read this fantastic article by Tom Wolfe in Harper’s Magazine.

The war for the very foundation of linguistics is being fought right now, and picking your battles is important. The scientists we talked to chose to sidestep the issue and focus on their interesting findings about timbre.

While the researchers intend to continue exploring this newfound phenomenon, Piazza thinks the find might prove useful for educational purposes. She envisions “having virtual teachers or cartoon characters imitate infant-directed timbre to optimally engage with babies.”

“Our work also invites future explorations of how speakers adjust their timbre to accommodate a wide variety of audiences, such as superiors, political constituents, students, and romantic partners,” Piazza said.

So who’s the good reader of our site? You’re a good reader!! Yes you are!

A version of this story appeared on the PBS NewsHour website. Banner image credit: Elise Piazza.

Notify of
Oldest Most Voted
Inline Feedbacks
View all comments

Get our latest stories delivered to your inbox.