Gartner estimated —prematurely— that, by the end of 2019, 20% of all user interactions with smartphones will be through voice user interfaces like Siri. Well, we’re not there yet but definitely getting there as mobile voice assistants improve. Across a diverse range of industries and in consumer homes alike, the VUI is becoming a ubiquitous tool for getting quick access to vital information. But for voice to truly replace the traditional visual interface, it first needs to become smarter. Emotion AI analysis is being developed to this end, helping to identify not only the content of speech but the context and subtext.
But there is a problem with the underlying technology being used to inform emotion AI in voice. These tools are designed by humans, who use the concepts and impressions that are available to them. In this case, that means a narrow understanding of emotions based on decades-old research. Originally presented by Paul Ekman in the 1960s, the six human emotions are described as anger, fear, disgust, sadness, happiness, and surprise. This specific understanding of emotion has been used in western psychology, research studies, and now technology projects as a baseline against which emotions are quantified and measured.
Ekman’s discoveries have been the driving force behind emotion detection in law enforcement, research, and technology for the better part of 50 years. The problem, however, is that never before have these findings (which were previously thought to be nearly universal) been subjected to the kind of scale that voice assistants have put them through. With hundreds of millions of devices now in hands around the globe, cracks start to show in the way the technology is being used based on these six core emotions.
Emotion is More Complex than We Understand it to Be
While there are certainly commonalities in human emotion and Ekman’s work has been instrumental in many developments in the last half-century, there are simultaneously nuances that were not previously measured. Take for instance the fact that emotions are consistently defined in different ways by different psychologists. The mere fact that we call them “emotions” is relatively new itself.
Another major issue is that while Ekman went out of his way to study the understanding of the six emotions he hypothesized on individuals who had never been exposed to western media, it’s now unclear if that was the case. The tribe he met with had previously encountered missionaries, and the quality of the translation may not have been validated.
Finally, there comes the issue of nuance. In most emotion studies, exaggerated facial expressions and vocal patterns are used. If someone scowls and barks at the test subject, they are certainly likely to experience that individual’s anger. But what about subtler, quieter experiences that may not be immediately evident as anger? Is there not a spectrum between anger and sadness? What does it look like when someone is angry…at themselves?
While Ekman’s research would indicate that the core trigger for these emotions is the same across all of these situations, we’re starting to find that there are far more nuances that don’t hold up to scrutiny when applying six basic archetypes to a sample size of millions across dozens of cultures and languages.
How Language, Culture, and Dialect Reflect Emotion
Humans don’t share any one universal language. There are certain shared emotions that we all feel, hard-coded into our DNA as a species, but many others are influenced by millennia of cultural, geographic, and linguistic influence. The very nature of a language can change how someone expresses how they feel, which in turn impacts the root of that feeling.
This leads us to the question of context. For Emotion AI to work it needs to understand not just what someone says but why they said it and the situation in which they used a certain tone of voice. Is someone actually angry or did they just get out of the car after driving through heavy traffic? What impact does a single unpleasant experience have on how emotion is perceived by an Emotion AI engine?
At a macro level, what impact does the language being evaluated have on the perception of emotion? Emotion recognition technology is —theoretically— language agnostic; it does not analyze the words or language being used by default. It looks for intonation and vocal cues to capture and measure emotions within the context of a situation.
But different languages and different cultures use different words to convey different meanings. Some languages are louder, with meanings that shift based not just on syllables but the tone at which words are spoken. Others are layered with meaning based on the culture, even if the language is similar. For an Emotion AI engine to accurately analyze the same sentence spoken by someone in English versus someone speaking in Chinese, for example, thousands of data points need to be taken into consideration related to both language and cultural clues.
How This Applies to the Use of AI
While this throws a wrench in some of the promise of voice user interface “intelligence” and its ability to fully understand what someone is saying and their state of mind when saying it, we still aren’t that far off. The current platforms provide a baseline against which to measure emotion. The next step is applying layers of context that can analyze based on factors outside of user input.
There’s a definitive line right now when the AI breaks down sharply. Studies have shown how consumer virtual assistants can because they lack the ability to go beyond pre-programmed response routing. They aren’t yet able to delve into the context of a situation, recognize the emotion in a user’s voice and respond in a way that is helpful to that degree. Sentiment analysis is being worked on to address this, taking into account language, dialect, slang, sarcasm, abbreviations and acronyms, and much more to better understand how people speak and emote.
As the technology advances and as Emotion AI becomes more nuanced in its ability to evaluate the meaning behind user input, VUI technology will become more robust and capable of parsing human emotion in a meaningful way.