chromium/docs/accessibility/os/select_to_speak.md

# Select to Speak (for developers)

Select to Speak is a Chrome OS feature to read text on the screen out loud.


There are millions of users who greatly benefit from some text-to-speech but
don’t quite need a full screen reading experience where everything is read
aloud each step of the way. For these users, whether they are low vision,
dyslexic, neurologically diverse, or simply prefer to listen to text read
aloud instead of visually reading it, we have built Select-to-Speak.

## Using Select to Speak

Go to Chrome settings, Accessibility settings, “Manage accessibility Features”,
and enable “Select to Speak”. You can adjust the preferred voice, highlight
color, and access text-to-speech preferences from the settings page.

With this feature enabled, you can read text on the screen in one of three ways:

- Hold down the Search key, then use the touchpad or external mouse to tap or
drag a region to be spoken

- Tap the Select-to-Speak icon in the status tray and use the mouse or
touchscreen to select a region to be spoken

- Highlight text and use Search+S to speak only the selected text.

Read more on the
[Chrome help page](https://support.google.com/chromebook/answer/9032490?hl=en)
under “Listen to part of a page”.

## Reporting bugs

Use bugs.chromium.org, filing bugs under the component
[UI>Accessibility>SelectToSpeak](https://bugs.chromium.org/p/chromium/issues/list?sort=-opened&colspec=ID%20Pri%20M%20Stars%20ReleaseBlock%20Component%20Status%20Owner%20Summary%20OS%20Modified&q=component%3AUI%3EAccessibility%3ESelectToSpeak%20&can=2).

## Developing

*Select to Speak will be abbreviated STS in this section.*

### Code location

STS code lives mainly in three places:

- A component extension to do the bulk of the logic and processing,
chrome/browser/resources/chromeos/accessibility/select_to_speak/

- An event handler, ash/events/select_to_speak_event_handler.h

- The status tray button, ash/system/accessibility/select_to_speak/select_to_speak_tray.h

- Floating panel, system/accessibility/select_to_speak_menu_bubble_controller.h

In addition, there are settings for STS in
chrome/browser/resources/ash/settings/os_a11y_page/select_to_speak_subpage.*

### Tests

Tests are in ash_unittests and in browser_tests:

```
out/Release/ash_unittests --gtest_filter=”SelectToSpeak*”
out/Release/browser_tests --gtest_filter=”SelectToSpeak*”
```
### Debugging

Developers can add log lines to any of the C++ files and see output in the
console. To debug the STS extension, the easiest way is from an external
browser. Start Chrome OS on Linux with this command-line flag:

```
out/Release/chrome --remote-debugging-port=9222
```

Now open http://localhost:9222 in a separate instance of the browser, and
debug the Select to Speak extension background page from there.

## How it works

Like [Chromevox](chromevox.md), STS is implemented mainly as a component
Chrome extension which is always loaded and running in the background when
enabled, and unloaded when disabled. The only STS code outside of the
extension is an EventRewriter which forwards keyboard and mouse events to
the extension as needed, so that the extension can get events systemwide.

The STS extension does the following, at a high level:

1. Tracks key and mouse events to determine when a user has either:

    a. Held down “search” and clicked & dragged a rectangle to specify a
    selection

    b. Used “search” + “s” to indicate that selected text should be read

    c. Has requested speech to be canceled by tapping ‘control’ or ‘search’
    alone

2. Determines the Accessibility nodes that make up the selected region

3. Sends utterances to the Chrome Text-to-Speech extension to be spoken

4. Tracks utterance progress and updates the focus ring and highlight as needed.

### Select to Speak extension structure

Most STS logic takes place in
[select_to_speak.js](https://cs.chromium.org/chromium/src/chrome/browser/resources/chromeos/accessibility/select_to_speak/select_to_speak.js).

#### User input

Input to the extension is handled by input_handler.js, which handles user
input from mouse, keyboard, and touchscreen events. Most logic here revolves
around keeping track of state to see if the user has requested text using
one of the three ways to activate the feature, search + mouse, tray button
+ mouse, or search + s.

#### Determining selected content

Once input_handler determines that the user did request text to be spoken,
STS must determine which part of the page to read. To do this it requests
information from the Automation API, and then generates a list of
AutomationNodes to be read.

##### With mouse or touchpad

select_to_speak.js fires a HitTest to the Automation API at the center of
the rect selected by the user. When the API gets a result it returns via
SelectToSpeak.onAutomationHitTest_. This function walks up from the hit
test node to the nearest container to find a root, then back down through
all the root’s children to find ones that overlap with the selected rect.
Walking back down through the children occurs in NodeUtils.findAllMatching,
and results in a list of AutomationNodes that can be sent for speech.

If the rect size is below a certain threshold, all nodes within overlapped
block parent are selected.

##### With search + s

select_to_speak.js requests focus information from the Automation API. The
focus result is sent to SelectToSpeak.requestSpeakSelectedText_, which
uses Automation selection to determine which nodes are selected. The
complexity of logic here is converting between Automation selection and
its deep equivalent, i.e. from parent nodes and offsets to their leaves.
This occurs in NodeUtils.getDeepEquivalentForSelection. When the first and
last nodes in selection are found, SelectToSpeak.readNodesInSelection_ is
used to determine the entire list of AutomationNodes which should be sent
for speech.

#### Speaking selected content

SelectToSpeak.startSpeechQueue_ takes a list of AutomationNodes, determines
their text content, and sends the result to the Text to Speech API for
speech. It begins by mapping the text content of the nodes to the nodes
themselves, so that STS can speak smoothly across node boundaries (i.e.
across line breaks) and follow speech progress with a highlight. The mapping
between text and nodes occurs in repeated calls to
ParagraphUtils.buildNodeGroup to build lists of nodes that should be spoken
smoothly.


Each node group is sent to the Text to Speech API, with callbacks to allow
for speech progress tracking, enabling the highlight to be dynamically
updated with each word.

#### Highlighting content during speech

On each word boundary event, the TTS API sends a callback which is handled
by SelectToSpeak.onTtsWordEvent_. This is used to check against the list of
nodes being spoken to see which node is currently being spoken, and further
check against the words in the node to see which word is spoken.

#### Edge cases

STS must also handle cases where:

- Nodes become invalid during speech, i.e. if a page was closed. Speech
should continue, but highlight stops.

- Nodes disappear and re-appear during speech (a user may have switched
tabs and switched back, or scrolled). Highlight should resume.

This occurs in SelectToSpeak.updateFromNodeState_.

### Communication with SelectToSpeakTray

STS runs in the extension process, but needs to communicate its three states
(Inactive, Selecting, and Speaking) to the STS button in the status tray.
It also needs to listen for users requesting state change using the
SelectToSpeakTray button. The STS extension uses the AccessibitilityPrivate
method setSelectToSpeakState to inform the SelectToSpeakTray of a
status change, and listens to onSelectToSpeakStateChangeRequested to know
when a user wants to change state. The STS extension is the source of truth
for STS state.

### Navigation features

STS will display a floating control panel when activated. The control panel
hosts controls for pause/resume, updating reading speed, navigating by sentence
or paragraph, and deactivating STS.

#### Floating control panel

The panel is implemented as a native ASH component
[select_to_speak_menu_bubble_controller.h](https://source.chromium.org/chromium/chromium/src/+/main:ash/system/accessibility/select_to_speak/select_to_speak_menu_bubble_controller.h).
Similar to focus rings, the STS component extension communicates with the panel
via the `chrome.accessibilityPrivate` API. The
`chrome.accessibilityPrivate.updateSelectToSpeakPanel` API controls the
visibility and button states, and panel actions are communicated back to the
extension by adding a listener to
`chrome.accessibilityPrivate.onSelectToSpeakPanelAction`.

When the panel is displayed, STS will no longer dismiss itself when TTS
playback is complete. The user must quit STS either from the panel or
the tray button.

##### Keyboard shortcuts

When the panel is displayed, it is initially focused and captures keypresses to
implement keyboard shortcuts:

*  Space - activates currently focused button, which is 'Pause/Resume'
   initially.
*  Left Arrow - Navigate to previous sentence (for RTL languages, this is Right
   Arrow)
*  Right Arrow - Navigate to next sentence (for RTL languages, this is Left
   Arrow)
*  Up Arrow - Navigate to previous paragraph
*  Down Arrow - Navigate to next paragraph

If the panel loses focus, keyboard shortcuts will no longer work. User can press
Search+S keyboard shortcut (with no text selection) to restore focus to the
panel.

##### Disallowed nodes

The panel is not shown when STS is activated on nodes where navigation features
do not add value, such as in system UI or top-level windows.

*  System UI nodes - any nodes that have a `root` with role `desktop`
*  Root nodes that are children of the root `desktop` node

#### Pause/Resume

Since `chrome.tts.pause` and `chrome.tts.resume` are not consistently
implemented across all TTS engines, STS implements pause/resume functionality
using the `chrome.tts.stop` and `chrome.tts.speak` APIs. While TTS is playing,
STS keeps track of the current word offset, and when TTS is resumed, it will
call `speak` with text trimmed to the start of the last spoken word.

Resuming TTS behaves differently depending on the context:

*  If TTS was paused within the user-selected text, resuming will play until
   the end of the selected text.
*  If TTS stopped when it reached the end of the selected text, but before the
   end of the paragraph, resuming will continue from that point to the end of
   the paragraph.
*  If TTS stopped when it reached the end of a paragraph, resuming will speak
   the next paragraph.

#### Paragraph navigation

Users can navigate to adjacent paragraphs from the current block parent when
Select-to-speak is active. A 'paragraph' is any block element as defined by
[ParagraphUtils.isBlock](https://source.chromium.org/chromium/chromium/src/+/main:chrome/browser/resources/chromeos/accessibility/select_to_speak/paragraph_utils.js)
and the navigation occurs in DOM-order.

#### Sentence navigation

Paragraphs are split into sentences based on the `sentenceStarts` property of
an AutomationNode. Users can skip to previous and next sentences using similar
technique as pause/resume (`stop` then `speak` with trimmed text). See
[sentence_utils.js](https://source.chromium.org/chromium/chromium/src/+/main:chrome/browser/resources/chromeos/accessibility/select_to_speak/sentence_utils.js)
for logic on breaking node groups into sentences.

#### Reading speed

Users can slow down or speed up TTS speaking rate using the floating control
panel. The rate the user selects in the panel is multiplied by the system
default TTS rate. So if the user selects 1.2x reading speed in the panel and
has a system default of 2.0x, the effective TTS rate will be 2.4x.

When users adjust reading speed, `chrome.tts.stop` is called, and
`chrome.tts.speak` is then called with text trimmed to the current word
position, passing in the new effective TTS rate as an option.

### Special case: Google Drive apps

Google Drive apps require a few work-arounds to work correctly with STS.

- Any time a Google Drive document is loaded (such as a Doc, Sheet or Slides
document), the script
[select_to_speak_gdocs_script](https://cs.chromium.org/chromium/src/chrome/browser/resources/chromeos/accessibility/select_to_speak/select_to_speak_gdocs_script.js?q=select_to_speak_gdocs_script.js+file:%5Esrc/chrome/browser/resources/chromeos/accessibility/select_to_speak/+package:%5Echromium$&dr)
must be executed to remove aria-hidden from the content container.

- Using search+s to read highlighted text uses the clipboard to get text data
from Google Docs, as selection information may not be available in the
Automation API. This happens mostly in input_handler.js.

### Enhanced network voices

As of M94, Select-to-speak supports natural, server-generated voices. When
enhanced network voices are enabled, Select-to-speak passes the user's selected
natural voice name to `chrome.tts.speak`. The TTS request is handled by the
[Enhanced Network TTS engine](https://source.chromium.org/chromium/chromium/src/+/main:chrome/browser/resources/chromeos/accessibility/enhanced_network_tts/).
The TTS engine then passes the request to native code
([ EnhancedNetworkTts](https://source.chromium.org/chromium/chromium/src/+/main:chromeos/ash/components/enhanced_network_tts/enhanced_network_tts_impl.h)),
which in turn sends a network request to the ReadAloud API, which produces
synthesized audio.

For instructions on how to add new voices, see
[go/chromeos-natural-voices](go/chromeos-natural-voices).


## For Googlers

For more, Googlers could check out the Select to Speak feature design docs
for more details on design as well as UMA.

- Overall product design, [go/select-to-speak-design](go/select-to-speak-design)

- On-Screen UI for touch and tablet modes,
[go/chromeos-sts-on-screen-ui](go/chromeos-sts-on-screen-ui)

- Reading text at keystroke,
[go/chromeos-sts-selection-keystroke](go/chromeos-sts-selection-keystroke)

- Reading text at keystroke in Google Drive apps, [go/sts-selection-in-drive](go/sts-selection-in-drive)

- Per word highlighting,
[go/chrome-sts-sentences-and-words](go/chrome-sts-sentences-and-words) and
[go/chromeos-sts-highlight](go/chromeos-sts-highlight)

- Navigation features, [go/enhanced-sts-dd](go/enhanced-sts-dd)

- Enhanced network voices, [go/wavenet-chromeos-dd](go/wavenet-chromeos-dd)