This patent describes a method for enabling more natural conversations with automated assistants. The core problem addressed is the unnatural, turn-based nature of current automated assistant interactions, which doesn’t reflect how humans converse. The patented invention introduces a “soft endpointing” mechanism that allows the assistant to better understand when a user has paused versus completed their utterance, enabling more fluid and natural conversations. The system can provide natural conversational outputs like “Mmhmm” or proactively ask for clarification, rather than immediately fulfilling potentially incomplete requests, thus improving the user experience.
Main Themes and Important Ideas:
The Problem of Turn-Based Dialogs: The document highlights the limitations of traditional turn-based dialog sessions where the assistant only responds after a user completely stops speaking. This is deemed unnatural because humans often provide multiple utterances with pauses in between to convey a single thought, and understanding may require considering the context across these utterances. The document states: “However, these turn-based dialog sessions, from a perspective of the user, may not be natural since they do not reflect how humans actually converse with one another.” ([0002])
Soft Endpointing: The core innovation is the concept of “soft endpointing,” where the automated assistant doesn’t rigidly wait for a definitive end-of-speech signal. Instead, it analyzes audio-based characteristics and natural language understanding (NLU) output to determine if a user has merely paused or has finished speaking. This allows the assistant to react more intelligently to user input that might be fragmented or include natural pauses.
Utilizing Audio-Based Characteristics: The system analyzes various audio cues like “intonation, tone, stress, rhythm, tempo, pitch, elongated syllables, pause, grammar(s) associated with pause” ([0005], [0039]) to infer the user’s intent and whether they are likely to continue speaking. Elongated syllables, for example, might indicate uncertainty and a pause for thought rather than the end of an utterance.
Leveraging Natural Language Understanding (NLU): The NLU output, including predicted intents and slot values, is also used to determine if an utterance is complete enough for processing or if the user is likely to provide more information. For instance, an utterance with a missing required slot value might indicate a pause rather than completion.
Natural Conversation Output: When the system determines a pause rather than completion, and it believes it can initiate fulfillment based on the NLU output but refrains from doing so, it can provide “natural conversation output” to the user. Examples include “Mmhmm” or “Uhhuhh” ([0006], [0008]) to signal that the assistant is still listening and waiting for the user to continue. This aims to create a more natural conversational flow. The document explains: “…refraining from initiating fulfillment of the spoken utterance and rather determining natural conversation output to be provided to the user to indicate the automated assistant is waiting for the user to complete providing the spoken utterance…” ([0006])
Fulfillment Initiation: The system initiates fulfillment (e.g., performing an action or providing a response) when it determines the user has completed their utterance based on the audio cues and/or the NLU understanding.
Handling Uncertainty and Ambiguity: The examples provided illustrate how the system handles cases where the user is unsure (e.g., “call Arnolllld’s”). The elongated syllable triggers the “soft endpointing,” preventing a premature or incorrect action. The system can then use natural conversation output to allow the user to clarify (e.g., by saying “Arnold’s Trattoria”).
Managing Calendar Lookups with Incomplete Information: Another example (“what’s on my calendar forrrr”) shows how the system can react to incomplete date information. It might still provide a response based on an inferred current date while remaining ready for further clarification or interruption from the user.
Visual and Other Feedback: Besides audible natural conversation output, the system can also use visual cues (e.g., streaming transcription with bouncing ellipses on a display) or LED illumination to indicate it is waiting for the user to continue speaking.
Partial Fulfillment: The document also mentions the possibility of “partially fulfilling” requests while waiting for the user to complete their utterance to reduce latency. This could involve establishing connections or pre-processing data.
Temporal Considerations: The system can incorporate time thresholds to determine when to provide natural conversation output and how long to wait for the user to resume speaking before prompting or initiating fulfillment.
Key Examples:
“call Arnolllld’s” followed by “Arnold’s Trattoria”:
This illustrates how soft endpointing, triggered by the elongated syllable, prevents a premature call to a generic “Arnold” and allows the user to specify the intended contact. The assistant might provide a “Mmhmm” to indicate it’s listening.
“what’s on my calendar forrrr” followed by a pause:
This demonstrates how the system might provide a natural conversation cue and then potentially fulfill the request with an inferred date (e.g., today) if the user doesn’t provide more information, balancing efficiency with potential for incorrect assumptions.
“Assistant, make a reservation tonight at Arnold’s Trattoria for six people” followed by a pause:
Here, even with missing information (time), the assistant might initiate the reservation process and prompt the user for the missing slot value using a more explicit natural conversation output like “For what time?”.