How Alexa Conversations Works
• GA:
en-US
, en-AU
, en-CA
, en-IN
, en-GB
, de-DE
, ja-JP
, es-ES
, es-US
• Beta:
it-IT
, fr-CA
, fr-FR
, pt-BR
, es-MX
, ar-SA
, hi-IN
Alexa Conversations uses an AI-driven dialog management model to enable your skill to respond to a wide range of phrases and unexpected conversational flows to achieve a goal. The Alexa Conversations dialog management model includes the following three stages:
- Authoring – You provide annotated dialogs that represent the different conversational experiences your skill can support. For details on authoring, which involves creating and annotating dialogs, see Write Dialogs for Alexa Conversations and Best Practices for Authoring Dialogs in Alexa Conversations.
- Building – Alexa Conversations uses a dialog simulator to expand annotated dialogs into dialog variants that train the dialog management model. Dialog variants are natural ways in which a dialog can occur.
- Runtime – Alexa Conversations evaluates the trained dialog management model to process incoming events and predict actions. For details on the science behind the Alexa Conversations modeling architecture, see Science innovations power Alexa Conversations dialogue management.
This page focuses on the dialog simulator and what happens during runtime. We assume you're familiar with basic Alexa Conversations concepts, such as dialog acts, described in About Alexa Conversations.
Dialog simulator
The dialog simulator generates training data by generalizing your annotated dialogs to cover various ways a user might interact with your skill. For example, a user might say variations of utterances to invoke specific functionality, provide requested information out of order, or change previously provided information.
The dialog simulator generates the training data by expanding your annotated dialogs — including slot types, API definitions, utterance sets, and responses — into tens of thousands of dialog variants, phrasing variations, and uncommon alternatives to create a much wider range of possible dialog paths. This expansion improves the robustness of the dialog management model and enables you to focus on the functionality of your skill instead of on identifying and coding every possible way users might engage with your skill.
The following sections describe the dialog expansion methods:
- Utterance variations
- Slot value variations
- Requesting and informing missing API arguments
- User correction of slot values
- Confirming APIs
- Confirming arguments
- Invoking multiple APIs in a single turn
- Proactive offers
- Contextual carryover
Utterance variations
When you configure your Alexa Conversations skill, you create utterance sets to group different ways your user might respond to Alexa. For each utterance set, you provide a list of sample utterances. The dialog simulator uses these sample utterances to generate dialog variants.
Example
You provide a dialog to find a movie. Your utterance set contains both "Who directed {movie}?" and "Who's the director of {movie}?" In a variant, the dialog simulator replaces the user utterance with another sample utterance from the utterance set.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: Who directed Inception? (Invoke API |
User: Who's the director of Inception? (Invoke API |
Slot value variations
You provide slots by selecting built-in slot types, extending built-in slot types with slot values, or creating custom slot types with values. The dialog manager randomly samples these slots to generate dialog variants.
Example
You provide a dialog to recommend a movie. In the variant, the dialog simulator replaces the slot values "crime" and "Quentin Tarantino" with "comedy" and "Guy Ritchie".
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: I'd like to watch a crime movie by Quentin Tarantino. (Invoke API |
User: I'd like to watch a comedy movie by Guy Ritchie. (Invoke API |
Requesting and informing missing API arguments
You must provide annotated dialogs that demonstrate requesting and informing all arguments required to invoke an API. In doing so, you must create a response with the Request Args dialog act for each individual argument. For example, for a weather API that requires the city and date, you must create a response such as "What city?" and a response such as "What date?"
The dialog simulator automatically expands the dialogs to create variants that request missing API arguments. These dialog variants cover cases where the user doesn't provide all the requested slots in a single turn (underfilling) or gives more information than requested (overfilling).
Example
You provide a dialog to request a director and genre to recommend a movie. Your dialog includes responses for requesting the director and the genre. The dialog simulator generates dialog variants that request missing API arguments for RecommendMovie
.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: Can you recommend a movie? (Invoke API |
User: Can you recommend a movie? (Invoke API Another variant might be as follows. User: Can you recommend a movie? (Invoke API |
User correction of slot values
You don't need to provide annotated dialogs that demonstrate users correcting slot values. The dialog simulator automatically expands your dialogs to create these variants.
Example
Without you providing a dialog with user corrections, the dialog simulator generates dialog variations such as the following.
Dialog you provide | Example variant from the dialog simulator |
---|---|
(None that provide user corrections.) |
User: What's a good comedy movie? (Invoke API |
Confirming APIs
You provide dialogs with Confirm API / Affirm dialog acts to indicate Alexa must confirm API arguments with the user before invoking the API. Each dialog that invokes a specific API must precede the API Success / API Failure dialog act with turns that confirm the API arguments. You aren't required to provide annotated dialogs with the Confirm API / Deny dialog acts. The dialog simulator automatically generates dialog variants that render the built-in reqmore
response if the user denies the confirmation. However, you can provide annotated dialogs with Confirm API / Deny if you want to support alternative dialog flows.
Example
You provide a dialog to recommend and then purchase a movie, and a dialog confirming that the user wants to purchase the movie. The dialog simulator generates dialog variations that confirm the ReserveMovie
API before invoking it.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: I'd like to watch a movie from Guy Ritchie. (Invoke API (Invoke API You provide another dialog that includes confirmation of the User: I'd like to watch a movie from Guy Ritchie. (Invoke API Alexa: Just to confirm, you want to purchase the movie Snatch? (Invoke API |
User: I'd like to watch a movie from Guy Ritchie. (Invoke API |
Confirming arguments
You provide dialogs with Confirm Args / Affirm to indicate Alexa must confirm the arguments after requesting them. Each dialog that requests a specific argument must follow the Request Args dialog act with a turn that confirms the argument. If the dialog simulator attempts to generate a dialog variant where both Confirm API and Confirm Args are applicable before invoking an API, it prioritizes Confirm API for that turn and confirms all required arguments with the user. You aren't required to provide annotated dialogs with the Confirm Args / Deny dialog acts. The dialog simulator automatically generates dialog variants that re-request the argument if the user denies the confirmation. However, you can provide annotated dialogs with Confirm Args / Deny to provide alternative dialog flows.
Example
You provide a dialog to request a director and genre to recommend a movie, and a dialog that confirms the director. The dialog simulator generates dialog variations that confirm the director before proceeding.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: Can you recommend a movie? You provide another dialog that includes the Confirm Args dialog act, such as the following. User: Can you recommend a movie? |
User: Can you recommend a movie? |
Invoking multiple APIs in a single turn
You can provide annotated dialogs with a single Alexa turn with API Success and API Failure dialog acts that invoke multiple APIs. When creating variants, the dialog simulator treats the sequence of these APIs as deterministic — that is, it doesn't change the order.
Example
You provide a dialog that invokes two APIs in a single turn to first get user preferences with the user's default city and then get the weather for that city before rendering the result.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: What's the weather today? Alexa: Today in Seattle, it's a high of 70 degrees with a low of 60 degrees. |
User: What's the weather today?
|
Proactive offers
You can provide annotated dialogs that proactively offer a dialog flow to invoke a new API after the user invokes the original API and receives the result. You can offer a new API by extending a dialog that ends with an API Success / API Failure turn with an Offer Next API turn, which can include requesting arguments on the same turn and/or passing in arguments from a prior API invocation.
You aren't required to complete the dialog after the Offer Next API turn. The dialog simulator automatically completes the dialog when creating dialog variants as long as there is another dialog that invokes the new API. However, you can complete the dialog after the Offer Next API turn to provide alternative dialog flows. Proactive offers are non-deterministic; dialog variants can include proactively offering different APIs and not proactively offering any APIs.
You aren't required to provide annotated dialogs with the Offer Next API / Deny dialog acts. The dialog simulator automatically generates dialog variants that render the built-in reqmore
response if the user denies the offer. However, you can provide annotated dialogs with Offer Next API / Deny to provide alternative dialog flows.
Example
You provide a dialog to reserve a table at a restaurant. You extend this dialog with an Offer Next API for an API to book an Uber.
Dialog you provide | Example variant from the dialog simulator |
---|---|
…(previous lines of dialog)… (Invoke API You extend this dialog with an Offer Next API such as the following. …(previous lines of dialog)… (Invoke API |
…(previous lines of dialog)… (Invoke API The dialog simulator also generates dialog variations for Offer Next API / Deny such as the following. …(previous lines of dialog)… (Invoke API |
Contextual carryover
You don't need to provide annotated dialogs that demonstrate contextual carryover. The dialog simulator supports this feature by introducing turns with pronouns (for example, "Please purchase it") when generating dialog variants from annotated dialogs. The carryover occurs at a later runtime stage, argument filling, when Alexa Conversations considers all slots that the user and Alexa mention across the entire dialog for filling API arguments.
Example
You provide a dialog to purchase a movie. The dialog simulator introduces a turn with a pronoun.
Dialog you provide | Example variant from the dialog simulator |
---|---|
User: I'd like to watch a movie from Guy Ritchie. (Invoke API (Invoke API |
User: I'd like to watch a movie from Guy Ritchie. (Invoke API (Invoke API |
Runtime
The Alexa Conversations runtime uses several components, including an inference engine, to evaluate the trained dialog management model. The inference engine receives events from the outside world (for example, the user says "Find showtimes for the Star Wars movie"), maintains conversation history for each session, manages dialog context, maintains the dialog state, and orchestrates information across different components within the runtime.
The runtime processes events and predicts the actions that should take place. For example, the action might be to respond to the user, call an AWS Lambda function, perform a calculation, and so on. The runtime either runs the predicted actions implicitly or transforms them to invoke an API.
The following diagram is a conceptual model of the Alexa Conversations inference engine, which hosts a machine-learning-trained dialog management model, processes events, and produces actions.
The inference engine uses context memory to manage dialog context and track the state, as well as the following three domain-specific models trained by machine learning:
- Named entity recognition – This model tags slots in the user utterance.
- Action prediction – This model predicts the action that should occur next.
- Argument filling – This model fills action arguments with entities from the context. An entity is a slot the user mentioned or return values from previous APIs. The inference engine performs entity resolution after argument filling is complete.
Context memory
Context memory is short-term memory that tracks user utterances and models results such as entities (that is, a slot that the user mentioned or a return value from a previous API), predicted actions, and Alexa responses. Context memory maintains dialog context for each session and is constantly updated throughout the session.
Named entity recognition
Named entity recognition is the first step after the inference engine receives a verbal utterance (for example, the user says "Find showtimes for the Star Wars movie"). Named entity recognition segments user utterances into word and phrases that correspond to slot types.
Named entity recognition interprets these phrases as slots and stores the slots in context memory. Later, the dialog management model uses these slots to fulfill actions such as invoking APIs or rendering Alexa responses. The debugging capability of the Alexa simulator, which is in the Test tab of the developer console, shows the phrases and slot types of named entity recognition. For details, see Debug an Alexa Conversations Skill Model.
Example
You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes
, which has an argument, title
, of type MovieTitle.
If a user says "Find showtimes for the Star Wars movie," named entity recognition recognizes "Star Wars", extracts "Star Wars" as a phrase, and labels this phrase as a MovieTitle
slot type as follows.
{Find|Other} {showtimes|Other} {for|Other} {the|Other} {Star Wars|MovieTitle} {movie|Other}
Other
means that named entity recognition didn't recognize a specific slot type.
At a later runtime stage, the argument filling model fills the MovieTitle
argument for API definition FindShowtimes
with "Star Wars", which in this case is a user-mentioned slot.
Action prediction
Next, action prediction processes the current conversation context and predicts the next action type and action name to run. The three action types are as follows:
- API – Invokes an API in the skill endpoint (for example, perform a ticket purchase transaction).
- Response – Renders a response to the user (for example, inform of a transaction result or request more information).
- System – Waits for the next user utterance. This action type is an internal/system action to indicate all tasks have run.
The action name can be an API definition name or a response name. The inference engine might run action prediction multiple times in a single turn until it predicts the System action type. The debugging capability of the Alexa simulator shows the API and response action types. For details, see Debug an Alexa Conversations Skill Model.
Example
You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes,
which has an argument, title
, of type MovieTitle
. A user says "Find showtimes for the Star Wars movie," in the following dialog.
User: Find showtimes for the Star Wars movie.
(Invoke API FindShowtimes
.)
Alexa: It is playing at 10:00pm at Downtown Seattle AMC.
Action prediction runs three times on the user utterance. The first run predicts the API action type with name FindShowtimes
and invokes the API. The second run predicts the Response action type with name InformMovieShowtimes
and renders the response. The third run predicts the System action type, which terminates action prediction, ends the current turn, and waits for the next user utterance.
Argument filling
When action prediction predicts an API action or a Response action, the next step is to determine how to fill the arguments with entities. An entity is a user-mentioned slot or a return from a previous API. Argument filling uses context memory data to access all available entities. Argument filling supports contextual carryover as it considers slots mentioned by the user and Alexa across the entire dialog. Argument filling then selects the most likely entities to fill arguments (of the same type as the entities), which the inference engine uses when invoking actions.
Example
You design your skill to book movie tickets. Your skill has an API definition for FindShowtimes,
which has an argument, title
, of type MovieTitle
and returns slot type ShowTimeInfo
, which has properties time
(slot type AMAZON.Time
) and theaterName
(slot type TheaterName
). The user says "Find showtimes for the Star Wars movie" in the following dialog.
User: Find showtimes for the Star Wars movie.
(Invoke API FindShowtimes
.)
Alexa: It is playing at 10:00pm at Downtown Seattle AMC.
The inference engine takes the following steps:
- Named entity recognition labels "Star Wars" as a
MovieTitle
slot type and stores the "Star Wars" slot in context memory as an entity title. - Action prediction, on its first run, predicts the API action with name
FindShowTimes
. - Argument filling uses the
title
entity, "Star Wars", to fill thetitle
argument of theFindShowTimes
API and then invokes the API. - Action prediction, on its second run, predicts the Response action with name
InformMovieShowtimes
. - Argument filling uses the
time
entity (a property from the API return) to fill thetime
argument of theInformMovieShowtimes
response and uses thetheaterName
entity (another property from the API return) to fill thetheaterName
argument of theInformMovieShowtimes
response and then renders the response.
Entity resolution
If the action prediction instance predicts the API action type, the inference engine performs entity resolution after argument filling is complete. For each entity to fill an API argument, entity resolution searches against build-time entities (or runtime entities in the case of dynamic entities) and resolves phrases into canonical values if there is a match. The inference engine inserts the entity resolution result as a separate payload in the API-invoking request to the skill. For details, see Receiving requests.
For details on entity resolution, see Define Synonyms and IDs for Slot Type Values (Entity Resolution). For details on dynamic entities, see Use Dynamic Entities for Customized Interactions.
Related topics
- About Alexa Conversations
- Dialog Act Reference for Alexa Conversations
- Get Started with Alexa Conversations
Last updated: Nov 27, 2023