A View From AVIOS
The Future of Intelligent Voice Assistants
By Phil Shinn
In the future, will you use one intelligent voice assistant (IVA) to mediate and connect you to the world, or will you prefer to use direct connections to various organizations’ agents? In a recent virtual panel discussion, AVIOS members debated the pros and cons of these different models and likely future developments.
One IVA as Majordomo
Here are the arguments in favor of the single IVA theory:
• Some very big companies are investing a lot to make this model happen.
• From the user’s perspective, one IVA is easier to learn than many. Consistency is simplicity.
• To be effective, an IVA majordomo needs significant personal information that would be difficult to transmit to every one of its minion assistants. Think email, contacts, shopping lists.
• Because personal information is in one place, it is safer.
• Authentication is much easier when it is simply one IVA rather than many.
• Having an assistant that works for you and not the enterprise removes conflicts of interest.
• In the future we will have teachable intelligent assistants who will learn about you, your history, your details, your relationships with other services, your preferences, even perhaps your goals.
And here are the mitigating arguments:
• The flip side of item No. 4 above: Because personal information is in one place, it could be less safe.
• Having a personal assistant that works for a global tech giant creates a conflict of interest.
• Skill discovery could be a challenge: “Alexa, tell Jeep to start my Jeep.”
• General IVAs do not offer consistent conversations across channels because the tech stack (automatic speech recognition [ASR], natural language understanding [NLU], dialogue management, text-to-speech [TTS]) is different, for example, between iOS, Android, POTs, and chat/SMS.
• Enterprises can’t control the brand. The branding is the general assistant.
• How do you transfer to a live person?
The Many-Minions Model
Again, let’s start with the pros:
• The enterprise can control branding and the persona.
• The costs of using the generic IVA and its cloud back end are unpredictable, and if it’s down, it’s down.
• From the user’s perspective, why would you trust a global tech giant any more than a global financial institution?
• The enterprise likely understands its own business better than the global tech giant does and is able to provide you with unique services and deals that the majordomo wouldn’t even know about.
• If you have a couple of different agents, you have some backup. Maybe one can find you a better deal than another. You have competition.
• Enterprise assistants offer a consistent omnichannel conversational experience because of a unified technology stack (i.e., you can use the same ASR across iOS, Android, and the phone, as well as the same NLU for these, plus chat).
• Reuse of these assets across channels saves investment by the enterprise.
• This model supports human-assisted understanding.
• Specialization of functions is complex and better suited to this model, as enterprise assistants can be optimized for specific domains (think healthcare).
• This model allows for consistent branding across channels.
And here are the cons:
• Enterprises would need to build support for multiple channels/agents.
• Cognitive load: Do I need to learn their names? Also, adding more apps doesn’t scale.
• If I go from one agent to another, they live in parallel but separate universes and have no idea what happened with the other one. They don’t talk to each other. If you’re stupid enough to rent cars from two different agents for the same trip, there’s no one to alert you.
Six of One…
Asking which model is better is like asking which car is better, a Honda or a Beemer: It depends. For who? For what? It’s similar to the question of whether an independent insurance agent is better than a direct agent (one who works for only one insurance company). Or how about a financial adviser who works for a brokerage who sells you only that brokerage’s products and is paid by the brokerage versus a wealth manager who takes a cut from your account for their unbiased advice. For some enterprise
tasks, say package tracking, using a general assistant is easy. For a complex task, like healthcare, you may want to see a specialist.
And the majordomo versus many-minions dichotomy is also false, because I might use Siri on my phone, Alexa in the kitchen, and Google in the car. So there are at least three personal agents working for me.
Nevertheless, it’s worthwhile to think through these things and educate yourself about what’s best for your situation, either as an individual or as an enterprise.
Phil Shinn is chief technology officer of ImmunityHealth and principal of the IVR Design Group.
Bots & Assistants Conference Summary - major trends and insights
By William Meisel
At AVIOS’s Bots & Assistants Conference in November, I led a panel discussion that got at a major issue with conversational systems: Though the accuracy of speech recognition and natural language understanding will continue to advance rapidly (thanks, in part, to the increasing amount of data available to drive machine learning), “answer technology”—which governs the ability of digital assistants to answer user questions—remains a big challenge.
The panelists—Jeff Blankenburg, principal technical evangelist for Amazon Alexa, and David Nahamoo, chief technology officer at Pryon—noted that breakthroughs with conversational technology are happening all the time, including the ability to personalize digital assistants so that they retain information about specific users and past conversations (though this could raise privacy concerns, like with healthcare data).
But both cautioned that there’s a bottleneck when it comes to providing automated answers to user questions. Answer technology is the part of a conversational system that connects to knowledge sources to address the intent of users. Today, digital assistants with a visual interface default to a web search when they don’t have a direct answer, providing a list of websites that might contain it. The goal of answer technology should be to provide a direct answer or perhaps ask a clarifying question that enables it to provide a direct answer.
Companies are motivated to provide customers with automated answers to reduce costs and provide faster customer service. Nahamoo also noted that employees are often overwhelmed by the amount of knowledge they need to do their jobs and need a quick way to discover that knowledge.
Today, the major type of user request that can yield direct answers are “Frequently Asked Questions” (FAQs). These, Blankenburg indicated, are often addressed by decision trees generated by humans, with the trees driven by keywords in the request. Decision trees can also be generated by machine learning rather than humans, but this requires a large database of labeled data with requests for specific answers phrased in many different ways, in practice largely limiting the approach to FAQs.
And the panelists noted a further difficulty: Much of the source data that contains answers is unstructured text, often distributed across websites, reports, books, magazines, newsletters, and other highly variable formats. Simply searching such documents for keywords doesn’t sufficiently distinguish appropriate answers.
As conversational technology grows as a popular alternative to manual web searches, it will motivate the assembling of source data in a form that makes it easier to find answers. One approach that we sometimes see that can at least narrow the documents searched is a list of keywords and phrases that identify the main context of the article.
Another approach that allows drilling down further in a document determined to be relevant is to use informative headings and subheadings for sections of unstructured documents, headings that, in effect, act as labels for potential answers. An automated system could search headings first with the knowledge that these are more than simply words in the document, but indicators of major content.
Such headings would have to be identified in the text as such. Microsoft Word has this kind of labeling as an option with its “Style” feature, where you can choose a format for headings and subheadings. These labels are searchable, as evidenced by Word’s ability to use them to create a table of contents or an outline.
A similar feature in other text formats, including Acrobat PDF files, could make this a more universal option. Perhaps a standard method of denoting headings could be formalized.
Answer technology could be the missing link to making conversational technology a universal approach to delivering knowledge. But to be fully effective, it might require a cooperative effort between those developing the digital assistant technology and those providing the sources of answers.
View the final talk at the conference, given by William Meisel, Executive Director of AVIOS, summarizing the conference content here.
Google’s Duplex Lets a Bot Be Your Voice
Thanks to the Duplex technology, Google Assistant can make simple calls on your behalf. But how much automation is too much?
By Phil Shinn
This summer Google rolled out Duplex, its technology for conducting two-way real-world conversations over the phone, with Google Assistant holding up your end of a call. It was demoed at the Google I/O developer conference by Google’s CEO, Sundar Pichai, who noted that 60% of small businesses don’t have automated reservation systems. So if you need to make a haircut appointment or get your car fixed or book a table at a restaurant, more often than not you’ll have to call. The star of the demo was Google Assistant, which will make that call for you. Pinchai played two recordings: a call to a hair salon to book an appointment, and a call to a restaurant to reserve a table.
Initial press was positive, but some critics focused on the two recordings, which were not live, insinuating that they were staged. There was a dial tone in the audio, but the human answerers don’t say the name of their business; there’s no ambient background noise; and neither the salon nor the restaurant asked for the caller’s phone number or other contact info.
You don’t need to be in the speech space long to learn that you might want to avoid antagonizing the live demo gods, so I for one have no problem with recordings. And one could presume the business names were edited for privacy. But what if the demo was scripted and/or edited? Why, I am sure this has never, ever been done before!
What bothered other folks was the system did not disclose its lack of humanity. In the demo recordings, there was no ‘“I’m a bot,” no earcon, no asking for DTMF input. Critics pointed to clever features the designers threw in—like ums and ers and other conversational dialogue markers—as “deception by design.”
Here was Google’s response: “We understand and value the discussion around Google Duplex—as we've said from the beginning, transparency in the technology is important. We are designing this feature with disclosure built-in, and we’ll make sure the system is appropriately identified. What we showed at I/O was an early technology demo, and we look forward to incorporating feedback as we develop this into a product.”
VUI designers have been wrestling with agent transparency for a long time. There’s actually a law that you have to tell people they are being recorded, which is why “Your call may be monitored…” is played to humans a billion times a day.
Should a bot use first-person pronouns? Interesting that a lot of identity claims over the phone start with first names only—“Hi this is Julie, how can I help you?” Should we put in earcons at the start to let user know what they’re dealing with? Or use a flat-affect TTS like Data from Star Trek?
In the early days of bots, when a voice bot asked you to punch buttons, it was pretty clear a bot was on the other end of the line. Later we got speech recognition and natural language processing grammars, so it wasn’t as clear. Some designers did in fact recommend playing an earcon at the start of the interaction to make the automation apparent to users, mainly because computers were still pretty lame when it came to having a dialogue and it was a good idea to set realistic expectations.
Now it’s like when you know three sentences of French, go to Paris, use them, and fail. The trouble with being too clever by half is how do you unwind it when it breaks? Just because you can pass the Turing Test doesn’t mean you have a right to.
Personally I like to know who or what I’m talking to up front. This should be the fourth law of robotics. Maybe I need a bot to answer calls to figure out if it’s a bot calling. Soon people will propose working out the details of meetings or projects by saying, “My bot will touch base with your bot.”
We know machines are stupid now when it comes to real conversation. (Not to mention ethics. When I get a call from “Barbara” pitching timeshares in the Bahamas, I hang up. This is why nobody answers the phone anymore.) Eventually, however, bots will probably get so smart that they’ll start feigning stupidity in order to get to talk to a person—and pass the Turing Test with flying colors.
Phil Shinn, who has been building bots since 1984, is on the board of AVIOS.
Speech Professionals Should Think Big but Meet Local
Local meet-ups foster much-needed community and connections.
By Sara Basson, Accessibility Evangelist, Google - Jan 30, 2018
The Applied Voice Input/Output Society (AVIOS) was founded in 1981 as a professional society for speech application development. Its mission was to provide education, increase successful deployments, and offer a bridge between the industry and academia. In the early days of speech technology, most of the attention was on interactive voice response (IVR) and call center applications, and AVIOS supplied a forum for exploring how to successfully design these services. The big focus now is conversational and multimodal interactions, with an emphasis on dialogue and natural language understanding (NLU). AVIOS has become an interactive hub for NLU researchers, designers, and technologists.
From the outset, AVIOS recognized that technical development is only one part of the equation; offering a forum for discussion and brainstorming across multiple disciplines was just as critical. After all, speech technology is just one part of a bigger picture that incorporates designers, user experience, marketing, and the whole conversational interaction ecosystem. The community that we serve also needs to include students; inspiring students to get involved in this exciting area and ensuring that the community remains fresh and incorporates the latest research approaches are important priorities. AVIOS strives to reflect the diverse communities we want to attract: established industry players, start-ups, university students, and faculty.
Annual conferences were a successful convening point, but it became clear through community feedback that AVIOS needed to create multiple points of contact, and this led to the formation of local chapters. Local chapters created the opportunity for more frequent meetings of people from different disciplines interested in speech technology applications, enabling them to address real-world challenges from different perspectives. When events are held locally, it also enables the exchange of ideas among speech and language enthusiasts worldwide, participants who might otherwise be unable to attend a single conference in a distant location. It allows AVIOS to bring the discussions to wherever local professionals, students, and interested parties can be found.
Over the past decade, local chapter events have been held in Australia, Brazil, and Canada, as well as stateside in cities such as Boulder and Seattle. There are active local chapters in New York/New Jersey, New England, Silicon Valley, and Israel. In New York, we have held meetings at company sites (IBM, Nuance, and AT&T) as well as at universities (CUNY and Columbia). Speakers have described long-term research, as well as near-term deployments, with multidisciplinary themes. Speakers of note in New York have included David Nahamoo, Michael Johnston, Raj Tumuluri, Bruce Balentine, and Roberto Pieraccini.
The focus at a recent local chapter event in Israel was human-machine interaction for special populations, with international experts sharing best practices from a global perspective. The New England chapter, meanwhile, has hosted Mike Phillips, Julia Hirschberg, and Alborz Geramifard, with topics ranging from the Amazon Echo to Jibo and social robotics to speech morphing and voice disorder detection. Silicon Valley is the most recently launched local chapter, and host companies have included Google, Ford Research, GE Digital, and Oracle, with a robust pipeline of leading companies eager to host in the future.
The more intimate gatherings often result in more productive proceedings. “The majority of persons in an AVIOS local chapter meeting are there for serious business, not light networking or job hunting,” says Sue Reager, president of Translate Your World (and a Speech Technology contributor). “Out of one small AVIOS meeting, five or 10 resulting conversations are highly likely. And because attendees seem to have harmonious products and services, local chapters are a good place to find ways to enhance whatever you yourself are developing.”
But they also offer speech professionals valuable camaraderie. “As someone in a start up with a tiny design team, it’s great to have a place for me to go to be around other designers in the field, so I can hear about others’ ideas, successes, failures, and frustrations,” says Cathy Pearl, director of user experience at Sensely.
“For me the networking is the best part of the meetings,” agrees Homayoon Beigi president of Recognition Technologies. “I am mostly in my own corner developing algorithms or writing code. AVIOS meetings give me an opportunity to get my head out of the water and see what is going on around me, not to mention meeting good people with similar interests in the same field.”
If you are interested in attending (or launching!) an AVIOS local chapter in your area, please contact email@example.com.
Finding harmony between human and artificial intelligence: Symbiosis of Human and Machine
By Michael Johnston, Director of Research and Innovation, Interactions Corporation - August 27, 2017
Currently, artificial intelligence (AI) technologies are having an increasing impact on many aspects of daily life. Artificial intelligence refers to the capability of a machine to mimic or approximate the capabilities of humans. Examples include tasks such as recognizing spoken words (speech recognition), visual classification and perception (computer vision), understanding user meaning (natural language understanding), and conducting a conversation (dialog). Increasingly systems combining constellations of AI technologies,that previously were only found in research prototypes, are coming into daily use by consumers in applications such as mobile and in-home virtual assistants (e.g. Siri, Cortana, and Alexa). Despite these successes, significant challenges remain in the application of AI -- especially in language applications as we scale from simpler information seeking and control tasks (“play David Bowie”, “turn on the lights”) to more complex tasks involving richer language and dialog (e.g. troubleshooting for technical support, booking multi-part travel reservations, giving financial advice). Among enterprise applications of AI, one approach that is gaining popularity is to forego the attempt to create a fully autonomous AI-driven solution in favor of leveraging an effective blend of human and machine intelligence.
Human intelligence has always played a critical role in machine learning. Specifically in supervised learning, human intelligence is generally applied to assign labels, or richer annotations, to examples used for training AI models which are then deployed in fully automated systems. Effective solutions are now emerging that involve the symbiosis of human and artificial intelligence in real time. These approaches vary in whether a human agent or artificial agent is the driver of the interaction. In the case of an artificial agent fielding calls, text messages or other inputs from a user, human intelligence can be engaged in real time to provide live supervision of the behavior of the automated solution at various different levels (Human-assisted AI). For example, human agents can listen to audio and assist with hard to recognize speech inputs, assigning a transcription and/or semantic interpretation to the input. They can also assist with higher level decisions, such as which path to take in an interactive dialog flow, or how best to generate an effective response to the user. In these cases, the goal is to contain the interaction in what appears to the customer to be an automated solution, but one that leverages human intelligence just enough to maintain robustness and a high quality of interaction.
In contrast in AI-assisted Human Interaction, the driver of the interaction is a human agent, and the user’s perception is that they are interacting with a person. The role of the AI is to provide assistance to the human agent in order to optimize and enhance their performance. For example, an AI solution assisting a contact center agent might suggest a possible response to return in text or read out to a customer. Several companies have recently explored the application of sequence-to-sequence models using Deep Neural Networks to formulate a response or multiple responses that an agent can adopt or edit. One of the great advantages of this setting for applying new machine learning algorithms is reduced risk of failure as the human agent maintains the final say on whether to adopt the suggested response or use another. In addition, human decisions to adopt, reject, or edit suggested responses provide critical feedback for improvement of the AI models making the suggestions. Another example of an AI-assisted Human Interaction is the application of predictive models based on user profiles and interaction history, to support a financial advisor with suggestions they can make to a client, or assist a sales person in recommending the optimal strategy to take for up-selling a product. Yet further applications of AI empowering human agents include within-call analytics to track customer or agent emotion and provide live feedback to the human agent on their own emotional state or that of the customer.
Perhaps the best solutions for customer care will combine both humans assisting AI and AI assisting humans: Customers will first engage with automated virtual assistants that respond to their calls, texts, messages and other inputs, and human assistance will play a role in optimizing performance. Then, if the call requires transfer to a human agent, that agent will be supported by an AI-enabled solution which quickly brings them up to speed on the history of the interaction and can assist them in real time as they respond to and engage with the customer.
Speech in the Connected Car: Embedded versus the Cloud
By Thomas Schalk, Vice President, Voice Technology, SiriusXM - February 1, 2017
The speech experience in the car is transforming from basic command and control, to natural interactions with automotive assistants.
When a vehicle has internet connectivity through a smartphone or an embedded modem in the vehicle, asking for nearby parking options is just one example of a realistic speech experience. Connected cars can leverage cloud-based speech recognition, such as Apple or Google’s speech technology, as well as off-board content and intelligent processing such as NLU and reasoning technology. Cars without connectivity rely on factory-installed embedded speech recognition, which is quite limited in comparison to cloud-based speech recognition.
Embedded Speech: Embedded speech recognition started appearing in vehicles during the early 1990’s and most new vehicles are equipped with it. Initially used for voice dialing, embedded speech systems have evolved to support music management, navigation, and various vehicle features such as climate control. However, if you pay attention to recent J.D. Power user satisfaction scores for embedded speech in the car, you almost get the impression that simple speech commands are difficult for the embedded technology to recognize.
The primary reason for these low user satisfaction scores is that in-vehicle recognizers often hear audio that is not expected, which leads to unexpected things happening. Drivers are expected to speak structured commands, as outlined in the vehicle owner’s manual – which many drivers ignore. For example, a driver can simply say a station number to change channels on the radio. But if the driver says an address without saying “address” first, anything can happen. Why force drivers to follow rules that have to be learned? Smartphones don’t behave this way because they are designed to handle naturally spoken requests.
Historically, embedded speech has faced the following constraints:
Usage behavior cannot be monitored and audio is not logged.
Consumers are not inclined to learn the proper interaction processes.
The vehicle’s CPU processing and memory is quite limited, even today.
Infotainment systems continue to evolve and become more complex to control.
Updating the recognizer is not practical after a new car is sold.
Embedded speech recognizers can’t adapt to the driving population’s voice patterns, which is a significant disadvantage in terms of optimizing the user experience. The speech algorithms and supporting statistical data are limited due to hardware constraints. Once installed in the car, the recognizer is frozen. The result: limited performance – especially at handling out-of-grammar commands. However, speech is very important for safety and drivers want it. And to be fair, after learning what to say and a little practice, most people have no problem using speech in the car.
Cloud Speech: Cloud-based automotive speech recognition became a reality a couple of years ago when cars were launched that supported Apple’s CarPlay and Google’s Android Auto. These infotainment solutions rely on the driver’s smartphone for the app and connectivity, and also rely on the vehicle’s display. Both solutions offer a better navigation experience than best-in-class embedded vehicle infotainment systems offer because people are used to their smartphone experience where entering a destination is so intuitive. The overall cloud speech experience is quite impressive and goes far beyond what drivers have become accustomed to in the car. Not only does the cloud enable highly accurate, very adaptable speech recognition, it also provides access to content and off-board intelligence that can be leveraged for the best possible user experience in the car. Dialing, texting, full navigation, music management, messaging and Siri-like assistance are all supported. The user interface is suitable for driving and speech input is key to the experience. Neither CarPlay nor Android Auto rely on embedded speech recognition, however without internet connectivity the speech function is lost.
The user interaction style for cloud speech is natural and quite different than the embedded speech style. For example, with cloud speech you never hear the classic car prompt: “Please say a command <beep>.” Instead, for cloud speech, the user hears a chime to start the experience and if needed, audio prompting without beeps is used. And with the natural interaction style that the cloud brings, we expect user satisfaction scores to improve dramatically. We are still waiting for some confirmed data on this topic.
Hybrid Speech: Earlier this year at CES, Nuance announced that its Dragon Drive – an embedded-cloud speech recognizer with NLU – had launched in BMW’s 2016 7 Series. First to hit the market, the hybrid speech solution features a conversational user interface that enables intuitive access to in-car functions and connected services while minimizing distraction. Ideally, a hybrid speech solution offers seamless access to all services and content – no matter where you’re driving. But for this to be true, the vehicle has to stay updated with relevant infotainment information such as nearby points of interest and other navigation information.
Conclusion: The trend is clearly toward cloud speech, but what about the embedded speech solutions? It would seem that with some clever user interface improvements and focusing only on managing the phone, navigation, and music, the user satisfaction scores could be improved. But let’s face it, we can all expect Google to be the infotainment system as the connected car ultimately drives toward autonomous driving.
Speech Technologies Are at a Tipping Point
Thanks to a convergence of technologies, seamless touch-talk-type interactions are now within reach
By Emmett Coin, Industrial Poet, ejTalk - August 1, 2016
The use of speech in applications ranging from interactive voice response (IVR) to desktop dictation to smartphone voice search has grown steadily and significantly over the past couple of decades. And AVIOS has been a part of that journey, chronicling the paths and pitfalls of these challenging and exciting technologies. AVIOS surveys the technological horizon and examines the future trends for speech and natural language, and the present trajectory of the underlying A.I. technologies is the reason that AVIOS has refined its focus. Our upcoming annual conference—at the end of January 2017, titled “Conversational Interaction”—will explore those trends.
The rapid convergence of two technologies in particular has brought our industry to a tipping point.
Parallel to the development of speech-only interaction (IVR) is the evolution of text-based and touch-based styles of interaction, which have become ubiquitous on the desktop and the smartphone in the form of chat windows. Lately, these types of text interactions have evolved into what are commonly called “chatbots” (“textbot” seems more accurate to me). We have all received tech support via a “chat window” on a web page. Granted, it all started as a real human helping us on the other side of that window, but the interaction felt normal in the way that texting has come to feel normal. When developers began automating this new interaction paradigm, it was well understood that typing had a much lower error rate than speech recognition. So it is not surprising that early efforts to automate a chat window style of interaction was built upon natural language processing/natural language understanding (NLP/NLU) text analytics and state machines (similar to what is used for IVR applications) to manage the interaction flow.
Initially, what these systems did most reliably was classify human intent into predefined subcategories and then transfer the user interface experience to an existing page that provided additional detailed information about that subcategory. In fact, most virtual agent–based chat windows today do precisely that kind of simple category detection followed by a redirection to more detailed information.
Concurrent with the chatbot evolution, speech-based interactions continued to develop on the telephone. Because of the early limitations of speech recognition, these systems focused heavily on extracting details in small chunks. Talking to our bank IVR, we could say “checking” or “savings” to direct the system. Later these systems supported short, well-formed directives such as “transfer $400 to checking.” But if you said “move $400 out of my savings into my other account,” it would most likely fail because neither the speech recognition nor the NLU was robust enough to handle utterances that open-ended. (One vague utterance opens the door to a vast number of potential utterances that the system must anticipate.) The handcrafted grammars and NLU analytics at that time were not up to the task.
But powerful advances have emerged over the past decade. We would not have predicted that by now the average person could do neardictation-quality speech recognition on a cell phone or with the built-in microphone on an inexpensive laptop from five feet away. Speech recognition is still far from being as good as a human, but it is good enough to do conversational transcription over less-than-ideal audio channels. While NLP/NLU has not made such dramatic advances, it has become good enough to do the needed analytics at conversational speed. One clear sign that NLU intent analysis is improving is that it’s available from multiple vendors as a RESTful microservice.
The major convergence of these technologies, along with multimodal fusion, has resulted in a natural synergy that gives us a seamless touch-talk-type multimodal interaction. Instead of being led through a dialogue, users want to be part of a richer conversation. They want to be part of a natural interaction, not simply micromanage an app. Rich interaction does not need to be a long, chatty conversation. It just needs to be aware:
Human: I’m leaving work at two today.
Computer: I’ll send a note to your team. Should I set your home thermostat for 2 p.m. arrival?
Human: Sure, thanks.
Computer: Okay, later.
Human: Oh, let Megan know, too.
Computer: Sure, I’ll text your wife you’re heading home.
Visit us at “Conversational Interaction” in January and meet the people who are creating this future.
Emmett Coin is the founder of ejTalk, where he researches and creates engines for human-computer conversation systems. Coin has been working in speech technology since the 1960s at MIT with Dennis Klatt, where he investigated early ASR and TTS.
By Deborah Dahl, Ph.D., Principal, Conversational Technologies - February 15, 2016
What do a voice-activated calorie counter, a grandmother’s guide to technology, a hotline for finding out course grades, a workout log, and learning when the next bus will arrive have in common? These are all applications that have been entered into AVIOS’s annual student speech programming contest. The 2016 contest is the tenth consecutive contest. Every year since 2007 AVIOS has organized this contest, which is open to high school, undergraduate and graduate students. The contest showcases creative and innovative speech applications developed by the next generation of developers.
The types of submissions AVIOS has received over the years reflect the evolution of speech technology and the speech industry. In the early years of the contest, submissions consisted primarily of voice applications using VoiceXML, as well as a few desktop multimodal applications. A lot has changed since 2007, when iPhones and Android phones had just come out and it was not yet possible to develop apps for either phone. In recent years, submissions have more often been native multimodal applications on mobile devices, along with some browser-based applications, mirroring the overall industry movement toward mobile apps.
Another interesting characteristic of the student contest submissions is that they have nearly all been end-user oriented applications designed to make everyday life easier. In contrast to the fast-moving technical landscape, everyday life hasn’t changed that much since 2007. Consequently, many of the applications submitted 10 years ago are still relevant. Applications entered into the very first student contest — a recipe reader and organizer, an algebra practice tutor, and a travel kiosk — would still be useful today. More recent winners have continued the theme of everyday life. One recent application was a storytelling application that let a caregiver record a story for a child in their own voice, with the ability to control the book by voice. Another application helped the user shop for a new house.
The contest is announced early in the fall, submissions are due in early January, and the winners are announced at the Mobile Voice Conference. After some challenging installation issues with desktop applications in the early years (the judges have to be able install the applications to run them), the entries have now been limited to applications that can be accessed from a phone number, run on common mobile devices or run in a browser. These options nevertheless give students plenty of scope for creativity.
Applications are judged by a panel of industry experts. The six judging criteria are robustness, usefulness, technical superiority, user friendliness, innovation, and creativity. Information enabling public to access the winning applications is posted on the AVIOS website after the winners are announced.
The student winners receive support for travel to Mobile Voice, and they are always excited to get a chance to see commercial applications and interact with speech technology professionals. This year AVIOS is very fortunate to have two generous sponsors of the contest. Google is sponsoring monetary prizes for the student winners and Openstream is donating funding to help cover the students’ travel expenses to Mobile Voice. In addition, Openstream is offering additional awards to any winners who use its Cue-me™ platform. Google is also inviting the winners to visit Google and present their application to the Google voice/speech team.
Submissions in the past have generally been developed as class projects in speech application classes, but that’s not a requirement. For example, applications could be developed by students working on their own, as a class project in a general programming class, or as part of a hackathon. The only requirement is that the applications be developed by students, either as individuals, or as part of a team of up to five members. If you know any students interested in voice-enabled applications, encourage them to enter the contest. And if you’re a university or high school computer science instructor, why not include a module about voice-enabled applications in your courses? That way, your students can learn how to develop speech-enabled apps and enter their projects to the AVIOS contest. Previous winners and faculty who have participated in the contest have reported a universally positive experience. The contest encourages students to develop creative ideas aimed at solving a real problem, which can be very different from work done in most college classes. Students interested in speech applications should definitely consider submitting an application for the 2017 contest, which will be announced in fall of 2016.
Keep your eyes on the AVIOS website, www.avios.org, for the 2016 contest winners and the 2017 contest announcement!
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.
Classical databases for speech are expensive and may lack diversity. But there's a solution: the Internet
By Nava Shaked - August 7, 2015
College students combine creativity with hands-on learning
By Marie Meteer - February 2, 2015
Meeting the demand for natural language interaction is becoming easier—and essential
If speech technology is so valuable, why is adoption so slow?
Uncovering customer needs adds value to interactions
Automotive safety calls for a holistic approach
For financial transactions, gains are seen but challenges remain
Watson and Siri are only the beginning.
Siri sets a high bar for call center technology.
Speech complements fusion, multimodality, and personalization.
Distractions like texting and navigation systems call for safer user interfaces, and the industry is responding
We're crossing the chasm and taking speech along.
Speech is still a tool, and should be treated as such.
Adjusting analytics to determine the meaning of a cough.
Make sure complexity and costs don't stop projects.
Opportunities abounded to try something new.
Students introduce voice to apps covering everything from airplanes to arithmetic
The state of the user experience in the speech community.
As the technology grows, customers will start to expect it more.
In designing a voice user interface, actual customers can best express the preferences of the customers.
Speech technologies and new XML standards enhance access to educational texts.
The new generation of applications and services will provide long-term growth for the industry.
How well do automotive speech-enabled interfaces meet customer expectations? One could say that it depends on the application, but beyond voice dialing, the room for improvement seems endless. In fact, it has only been recently that the usability of in-vehicle voice technology has begun to meet customer expectations.
At the recent SpeechTEK conference, a group of VUI specialists spent a day defining a set of success criteria for speech-enabled user interfaces. We necessarily limited our focus to criteria for which we could define metrics for measuring each quality. However, this leaves out one vital factor for the success of a VUI: the role of expectation. I’m referring to the expectations that users have when they interact with a speech-enabled application, the expectations that…
AVIOS (Applied Voice Input/Output Society) was launched as a non-profit society for speech technology professionals nearly 25 years ago. Speech recognition and synthesis were not widely deployed at the time, and there were no forums for interested parties to convene and share "best practices." Leon Lerman, the founder of AVIOS, noted this gap and set off to address it. As AVIOS grew, it became the conference for anyone with an interest in speech technology application…
The Applied Voice Input/Output Society (AVIOS) has been promoting the practical application of speech technology for over two decades. In the early days, the immaturity of the technology and the cost of implementing it offered few opportunities to build a "speech industry." Today, the improved technology and the lower cost of deploying it support many practical applications. Many people find that speech technology forms an important part of their work, implicitly creating participants in a...
A designer needs to create a user interface (UI) that makes up for deficiencies in technology, and needs to stretch technology to make an application usable. This is especially important in applications that use barge-in. In ways, barge-in provides a metaphor for many of the issues we encounter in designing an application. …
If the Star-Trek Communicator existed, communicating with machines would not require learning a programming language or the 'pidginization' of language it often feels like we have to endure today. Communication ease is limited by the state of core technologies; improving these technologies can lead to more speech applications. Improved speech applications can also come from rethinking the voice user interface. However, the greatest bottleneck to widespread adoption of speech technologies is the difficulty of balancing…
The voice user interface is evolving into a standard means for communication between humans and technology and is having a profound influence on the way people live. As such, the market for these speech applications is growing worldwide. Following necessary research and development, multilingual product offerings continue to expand, particularly for countries where more advanced telecommunications technology is common...