Humans are emotionally complex. Not just in how we feel, but in how we convey those feelings. What you say and what you do communicates only a fraction of your emotions. Thousands of other cues in body language, facial expressions, and especially in voice can paint a vivid picture of how you feel at any particular moment.
This is an important area of emerging research, especially as AI systems are programmed to start recognizing human emotion and responding in kind. A big part of that is recognizing where and how humans can “hide” or skew their emotions. In particular, we now know that voice is much more difficult to fake than facial expressions, and in general, both are difficult to fake for long periods of time. It takes a conscious, concerted effort to change your voice and physical expressions, so they don’t match how you actually feel. When it comes to voice, many of these nuances are noticeable in conversation, however subtle they might be.
The Ability of Humans to Capture Subtle Nuance in Speech
The same innocuous phrase can be expressed and perceived in dozens of different ways. A spouse asking, “did you remember the milk?” could be a simple question without any emotion – a tick on a box in a mental to-do list – or it could be judgmental, voicing disapproval at perceived incompetence. Or anything in between.
Humans have the capability of capturing these types of small nuances in voice, and in many cases intentionally coding them in. The research around this is constantly developing. Until recently, facial expressions and body language were identified as the most important elements of non-linguistic communication – the subvocal cues that could define intent in a conversation.
But a recent study by Michael Kraus (among others) identifies our sense of hearing as being more acute at detecting emotion in a conversation. His study showed a higher degree of accuracy in identifying emotions, not just when hearing a voice vs seeing facial expressions, but also when compared to both hearing and seeing facial expressions. When isolated, a voice is loaded with information that the human ear is particularly good at understanding.
Consider some of the simple cues that can convey a spectrum of emotion when in a conversation. Quick breathing, clipped words and excessive pauses might point to anxiety or being upset. A slow, monotone voice or a quieter tone than normal could point to exhaustion or illness. Faster, slightly louder speech could point to the excitement.
It goes further than just dissecting and understanding the base emotional state of someone based on their speech, though. As discussed in a recent article from the Berkeley Greater Good Science Center, Research by Emiliana Simon-Thomas and Dacher Keltner showed that humans can capture small nuances in speech, delineating between sadness, angry, repulsion, and exhaustion for example. And many of these cues are language independent. People are able to determine emotional state even when they are not fluent in the language being spoken.
Empathy in an Increasingly Digital World
Until recently, technology had allowed us to largely abandon face-to-face conversations in our daily lives. People spend an average of more than two hours per day on social media, use email far more phone their phones, and have shifted even short quick conversations in-office to communal platforms like Slack.
When so much of how we interact is tied up in our connection through speech, what impact does this digital transition have on our ability to feel empathy and truly connect with one another? Research on this is still developing but based on what we are learning about the role of vocal cues to understand emotions and connect with one another on a very basic human level, a lot is getting lost in text messages and emails.
Emotions are driven by our speech and our actions, more so than we previously realized. A recent study by Jean-Julien Aucouturier at CNRS in France asked people to read and record a short, innocuous story. Their voices were then altered and when played back, many of them would feel different based on what they heard. If their voice was sped up and the pitch raised, they felt more excited. Slower with pauses added – they would feel a bit more tired or unsure of themselves. It’s an interesting experiment that highlights the deep emotional impact our voices have on us. So, when we don’t use our voices to communicate, it raises several questions about how effectively we are connecting with others.
What This Means for Voice Assistants and AI Technology
More than 1 billion smart devices now have some form of voice assistant – from dedicated voice-activated devices at home to smartphones and tablets with Siri or Google voice tools. These are the tip of the iceberg in how voice AI is being integrated into our lives, and a big part of how effective these tools will be is their ability to recognize emotion in the human voice.
The words in a conversation are the tip of an iceberg – only a small percentage of the conversation we are really having. That’s where Emotion AI can play an outsized role, bridging the gap between what can be perceived and what is just under the surface in a conversation. An AI system can be programmed to automatically gather and analyze information encoded in the human voice. And increasingly, this goes beyond objective behavioral content such as what someone said or what they were doing when they said it. Today’s Emotion AI can automate a host of subjective signals – detecting not just that someone is frustrated, but how frustrated they are on a spectrum drawn from millions of data points in conversations it has analyzed.
While humans detect these cues automatically, we are not always accurate in decoding their meaning. At the same time, we are notoriously bad at recognizing such signals in our own voices. With the use of Emotion AI, we can improve the performance of customer service and sales teams, build more responsive personal voice assistants and leverage unstructured data in new and exciting ways across all industry. This is all possible with the advances in technology around voice AI.