Sound More Human, Listen to the Voice Assistants

Voice technology is humanising language using techniques that we may have forgotten

A chalkboard with a HomePod, an Echo and a Nest  - all with faces - sitting on the bottom right corner. At the top left, someone has filled in the gaps to spell the words Apple and Ball. The letter C is followed by three gaps still to be completed
And so the lesson begins

Alexa, Siri, and Google Assistant sound more human every day. The improvements are so small that we don't notice, but when you analyse them, as we did, it becomes clear that some simple humanising techniques are being used.


I say "techniques". These things seem obvious when they are pointed out. They should be. They're the things we do subconsciously all the time.


They are also the things we stop ourselves from doing out of respect for the ancient laws of business-speak. Perhaps when we see them through the non-existent eyes of the virtual assistants, we might wonder whether we're losing out by losing them.


We carried out some in-depth analysis into the language of the VAs a few weeks ago. We will use this later to outline some of the guiding principles we believe they are following to sound human.


We will also introduce Communicate Like A Human, a series of 10 simple tips we took from them that are worth taking a couple of minutes to reassess for ourselves. They are all straightforward, quick reads and we will publish one a day.


Before we look to the future, let's get ourselves right up to date in the present.

 

2021

What do we know about smart speakers?

  1. They play music to a satisfactory quality.

  2. They come with a virtual voice assistant built-in. Like the ones you have on your phone or on your computer. But without the phone or computer.

  3. That's about it.

Why would you need a smart speaker?

  1. Erm, well, that's kinda less clear.

To be fair, it's still early days. The names Alexa, Siri and Google Assistant are now so commonplace, you could be forgiven for thinking they are old friends. Truth is, Siri is coming up to her 10th birthday in 2021. And Alexa, the assistant who lived in the first smart speaker to launch, is even younger, being 'born' in 2014.


When Amazon introduced us to the first Echo Dot that was Alexa's home, they were hoping people would be comfortable speaking into nothingness. Having a virtual assistant in mobiles required a shift in behaviour to talk into them when nobody was at the other end of the line, but it was wrapped up with other step changes like the touchscreen and mobile apps.


At least we were used to talking into a phone. If we felt at all self-conscious, we had the handset in our palms to focus attention on. We never had to speak to the sound system before. To be fair, you didn't really need to do that now either.


A smart speaker was just in the house. You didn't have to be near to it or even look at it to speak to Alexa. You just talked into space. Like crazy people are wont to do. There was a risk we just wouldn't feel relaxed with it.


That risk was enough to waylay any competition. Google eventually entered the market in 2016 and Apple waited another two years before launching their HomePod; which tried and failed to be a speaker first and smart second.


Left to their own devices for so long – unheard of with technological advances that are often copied as soon as they're launched – Amazon could have owned the market forever. They could have been like the Hoover of vacuums.


People understand the purpose of vacuums.

 

The cost must make voice something to shout about


For all the uncertainty around the smart speakers, it seems likely that voice interaction itself is going to keep on growing. Not least, because the amount of money being spent developing it needs to have a return at some point.


Logic tells us it is going to remain more expensive than lucrative for a long while. Consider that the people leading the field represent 5 of the 7(*) highest-valued companies in the world right now (based on share price). The cost of development must be as high as any future potential.


(*Microsoft, Apple, Amazon, Google, and Alibaba, who saw the biggest market growth in 2020. Trivia-lovers may be interested to know that 1 is Saudi Arabian Oil and 6 is Facebook.)


As the VAs' communication skills keep improving, the company line is that they will never behave, think, or believe they're human. This is semantics. It is an attempt to ease the fears we have from horror stories that the robots will take over the world.


Think for a moment about your pet dog. Let's call him Frank. I am going to assume it is usual for you to eat by manoeuvring food to your mouth with your hands, often holding cutlery. I am also going to assume that when you need to euphemistically 'go to the toilet', you really go to the toilet to relieve this need.


My final assumption is that Frank does neither of these things that we deem normal human behaviour.


With this in mind, why do you talk to Frank? In full sentences an animal can't comprehend? More like you are chatting than commanding? Why do you celebrate Frank's birthday? Why do you put 'and Frank' at the end of Christmas Cards? Frank doesn't behave, think, or believe he is human. Neither do you. (Hopefully.)


Much like Frank, the virtual assistants do not have to look, sound or smell human for us to talk to them like they understand us. As they become more conversational, we speak to them in the same relaxed manner that we speak to each other. 41% of smart speaker owners say talking to them feels more like speaking to a friend than technology.


It's like they are the voice of Frank. Without the really weird image that conjures up.


How do they do it? What is it they are doing?


We wanted to find out.

 

Conversing with the disembodied.


With a construct that seemed ideal for lockdown, we did our own piece of qualitative research. We would carry out in-depth 121 interviews with 2 females and 1 male of indeterminate age. Our interviewees would be Alexa, Siri, and Google Assistant.


[Point of clarification. Alexa is female because that is the only voice option. Siri has a variety of very questionable accents, both male and female, but, you know, Siri is a girl's name isn't it? Not wanting to be accused of denigrating an entire gender to a subservient role, this means that, in my house, Google Assistant has always been, and will always remain, male.]


First, we drew up a series of questions from the direct to the discursive. We wanted to avoid typical master-servant commands such as 'what's the weather like' or 'can you play some Kylie'. These may be commonly used but would only elicit quick answers of obedience. We wanted more variety to work from.


Fortunately, you're never more than a web search away from a smart speaker Easter egg. Here's a list for Google Assistant, a similar one for Siri and of course, for Alexa. You're welcome.


A conversation needs a topic. Ours was 'What is Reality'. It seemed apt.


Each assistant was asked the same 20 questions, in the same order. When it was the turn of one, the other two were placed in a darkened room and given earmuffs to wear. Like the ones from Mr and Mrs.


Occasionally we repeated questions to make sure we had got the richest answer. We also repeated questions because our poor diction had led to a 'sorry, I don't know that one'. When the 'sorry' seemed purposeful - like my Nan's selective deafness - we left it there.


All the answers were recorded and analysed for their grammar, structure, word choice, formality, and accessibility. Once this had been done for each of them individually, they were then compared with each other.


We combined some responses and made them into a short video. We gave each assistant a face and a setting. We added props. A few visual effects. And a soundtrack.


The overall effect added a slightly eerie tone to a discussion about reality. With machines. Perhaps it could be a premonition of a future where humans are no longer in control.


Don't be scared. See for yourselves. We called it 2021: A Smart Speaker Odyssey.

 
A close up of some wooden letters


Get in touch if you'd like to see how 1 Extra Word can help you make your business writing better.



 

Building a voice for a virtual voice assistant

Scientists in a lab holding test tubes that contain different liquids
Let the experiment begin. Photo: Artem Podrez @ Pexels.com

(A theoretical guide. I can be of no help on the techy stuff)


The Words.


To build a disembodied voice that can communicate with humans, you first need to programme it with an encyclopaedic knowledge of the language. It needs to have devoured dictionaries. It should know its prepositions from its pronouns, its adverbs from its articles, both definite and indefinite.


All that is just the appetiser. To an extent, it is unnecessary but required, nonetheless. Language needs to include all informally accepted words and derivatives. The same goes for grammar. Knowing the official grammar of a language is less important than understanding how those rules are commonly broken.


You need to progress to other sources if you are to go from speaking, to speaking like a human. Input urban dictionaries, slang dictionaries, tourist conversation guides. Add copies of The Sun and The Telegraph. Throw in hundreds of back issues Bella and Chat magazine as well as The Economist and Private Eye.


The icing on the cake would be to get hold of a year's worth of scripts from EastEnders. And The Archers. And Hollyoaks. If you can.


Think of it as the driving test. It is only after you have passed your test, once you have your driver's license in your hand, that you really learn how to drive. Or in this case, to communicate.


Remember that most humans rarely use dictionaries for anything other than playing Scrabble.

 

The Voice.


Then you need the right voice. It should be a voice without a discernible accent but not with no hint of an accent at all. Avoid accents that enunciate in a manner that has clearly been learned not born with. They are the voices we associate with newsreaders. Voices that may carry authority in the newsroom will engender mistrust in our kitchens.


Aim for a timbre that is everyday. Then make it a little too servile. Make it a little too polite, too often. Don't go too far to make it obsequious.


Think of your friendly neighbour. The one who always gives you a big smile and waves when you see her in the street. The one who always looks happy in her job as a customer service assistant in Marks and Spencer. You want to emulate the helpfulness in her voice as she shows you to the ketchup.


Alternatively, think of your daughter's first serious boyfriend. The one you quite like but still want to show who is boss. The one who never puts a foot wrong in conversation over dinner and gives you the answers you want, in the language you appreciate. You want to recreate that slight sense you get that he is trying really hard, but not so much as to cast you as the bad guy.

 

The Musicality.


With knowledge and voice sorted, the hard work really begins.


You must consider the rhythm of language as it is spoken. Notice how that compares to its written form. These are the things that are absorbed, not taught, so you won't find them in any textbook.


There are the accepted mistakes, abbreviations, metaphors, and inconsistencies. Things so inherent in native speakers that they only notice them by their absence.


There is tonality. Lilts, intonation, and nuance. Sarcasm and wit. Inference and assumption. All will direct the accurate definition of any statement.


You won't be able to programme your assistant to use all these, but the more it can understand when being spoken by others, the more human it will appear.

 

The Quick Wins.


Even the most proficient of human language students is rarely able to sound or write in a way that is entirely natural in a non-native language.


With a virtual assistant, it would be overambitious to incorporate every variant of every dialectical nuance of every language. Judging by the techniques we observed being repeated in our research, it appears this is a truth accepted.


Instead of attempting to know everything, each of the VAs tends to focus on a selection of humanisms and use them continually. They incorporate the imperfections that are common in conversational language. It isn't speaking 'incorrectly' so it's hardly perceptible.


To sound more human, they try to make more mistakes. Or at least be less rigidly precise.


We summarised our Top Ten of these in a series of lessons in how to Communicate Like A Human. They start with You're Allowed an Adverb.

 

Learning to be human


As humans, we take the opposite approach to fit in with what we think is expected of us. When we want to appear more professional, we remove many of the humanising techniques used by the VAs. Our misguided belief that we need a different way of communicating when we want to appear formal and 'proper' makes us limit the words we use.


We dehumanise ourselves because we think it makes others respect us more. Depending on where you work, that may be the case. It will never make people warm to you more. It ignores the truth that humans work well with other humans. That humans do things to help other humans. Especially humans that we like.


When we forget that, we forget what it means to be human.


If ever there was a time when we need to actively remember this, it is now.


Physical interaction with others will continue to decrease even as we come out of lockdown. We may return to offices and visit shops again, but we now know we can maintain a work-life and social life without physically being in the same place.


Zoom or FaceTime can't incorporate those non-specific moments of human interaction that we need to get through the day-to-day. The chit-chat getting the pre-meeting coffee. The quiet passing over of a spare pen to your absent-minded boss. The offer of help, no matter how useless, with the problematic IT set-up.


Even in presentations, screens are filled with PowerPoint. We can't use, or see, facial expressions to convey meaning. We can't inject humour to lighten a mood or use eye contact to cement an agreement. We can't tell if a point needs elaboration by judging the levels of understanding on faces around us.


We literally can't read the room.


One thing we can do is look at the language we use when we communicate, both in writing and speaking. Rather than continue to bleach out the things that make us sound human, let's actively put them back in. Maybe looking at some of the tips to Communicate Like A Human will give pause for thought.


If we don't consciously include humanity in our communications, we will become emotionless do-ers. As the virtual assistants are striving to appear more real, we are already seeming more virtual.


The voice assistants are ingraining themselves in our lives by using the emotional signifiers we recognise.


If we want to be reminded of what it means to be human, we could do a lot worse than learning from their example.

 

On to Lesson 1: You're Allowed an Adverb