TABLE OF CONTENT
- What is TTS?
- How does TTS work?
- Could we manage text-to-speech at title level?
- How is TTS different from Audiobooks or using a Screen Reader?
- Why would someone choose to use TTS over an audiobook or using the screen reader?
- TTS vs SMIL vs Screen readers
- Kobo TTS Demo
What is TTS?
Text-to-speech (TTS) is a reading system feature that allows users to listen to book content through computer-generated audio. TTS is an essential feature for many readers with disabilities, particularly those with vision or cognitive disabilities.
How does TTS work?
TTS is built into the reading system as an optional feature for users to activate and customize to their needs. When TTS is turned on, the underlying engine ingests the book content and generates an audio version of the content that is then read out on the device. The audio version of the content is generated on the device and only when requested by the user. The audio version is not saved, it is always generated as needed.
In Kobo's implementation, we will be using the voice options provided to us by the platforms the applications are built for. This means our iOS app will use the default voices built into iOS, Android will use the voices built into Android, and so on. This method will give us the best opportunity to support a wide variety of languages and preferences, as most platforms have built in voice options for multiple languages and locales. This method also means this feature will be able to work online and offline.
Could we manage text-to-speech at title level?
As per the below extract of the Marrakesh treaty, disabling TTS will be prohibited in EU.
From the Official text of the Marrakesh Treaty – EAA (European Accessibility Act):
« The interoperability in terms of accessibility should optimize the compatibility of those files with the user agents and with current and future assistive technologies. […] It is recognized that persons with disabilities continue to face barriers to accessing content which is protected by copyright and related rights, and that certain measures have already been taken to address this situation for example through the adoption of Directive »
And, in the section “Accessibility requirements for Products and Services” (Section I):
« (ii) | e-readers shall provide for text-to-speech technology; »
And, in section IV “Additional accessibility requirements related to specific services:
« (ii) ensuring that e-book digital files do not prevent assistive technology from operating properly; […] (iv) allowing alternative renditions of the content and its interoperability with a variety of assistive technologies, , in such a way that it is perceivable, understandable, operable and robust; | (v) | making them discoverable by providing information through metadata about their accessibility features; | (vi) | ensuring that digital rights management measures do not block accessibility features. »
As such, you can manage TTS at the title level only outside of EU, using:
<EpubUsageConstraint>
<EpubUsageType>05</EpubUsageType><!--List 145 - ProductContentType = Text-To-Speech-->
<EpubUsageStatus>01</EpubUsageStatus><!--List 146 - Permitted unlimited -->
</EpubUsageConstraint>
How is TTS different from Audiobooks or using a Screen Reader?
TTS is a reading system feature for ebooks that reads the text content of the book using computer-generated audio. Audiobooks are an audio-only version of a book, usually read by human narrator(s). Screen readers are a type of assistive technology used by people with disabilities to perceive, navigate, and operate software by rendering the interface in audio or tactile (braille) formats. Each one has different use cases and serves different needs for users. Some users will use just one, others will switch between all three.
TTS most commonly reads the text one word at a time at a speed and voice selected by the user. The word that is being read is often highlighted so users who are looking at the screen can follow along. TTS can also be configured to read out additional content like image descriptions, footnotes, or print page numbers. Users may choose a voice option that suits their preferences, the gender of the narrator of the book, or is able to pronounce the content in their language correctly.
Screen readers also use computer generated speech, similar to TTS, and will highlight words or phrases as they are being read. Screen readers have significantly more features, such as allowing the user to select text, spell out words, and make annotations. Screen readers also enable the user to control the UI, and provide content like incoming notifications, alerts, and other device functionality.
Why would someone choose to use TTS over an audiobook or using the screen reader?
There are a variety of reasons a user may choose to use TTS over the existing options. First, not all ebooks have audiobook equivalents, meaning that for users that require or have a strong preference for audio, there is no audiobook option available to them.
Screen readers are powerful tools for users with vision and cognitive disabilities, but they are also complex and take practice to learn and use. Screen reader users also use TTS features when available, depending on the type of reading they are doing. Kobo has conducted informational interviews with users with reading disabilities and many of them used TTS, screen readers, and audiobooks interchangeably. A screen reader user who is reading a book for entertainment may use TTS or an audiobook, depending on availability, because they just want to listen to the content. They may switch to their screen reader for non-fiction or educational content in order to interact through annotations or other features.
Readers with cognitive disabilities like dyslexia or ADHD may use TTS as an aid while reading the text in order to help them focus and understand the content. TTS has also been cited as helpful for children as they learn to read, or language learners trying to build comprehension skills.
TTS vs SMIL vs Screen Readers
What is the difference between text-to-speech, SMIL, and screen readers for books?
All three features provide the same functionality to the reader: text content that is transformed into audio. All three features can be used interchangeably by readers as well, it's entirely possible a screen reader user may occasionally use text-to-speech, or intentionally purchase a book with SMIL (media overlays) included. However, each feature is different from one another in how they work, who provides the functionality, and the reader experience.
Text-to-speech, or TTS, is a feature provided by the reading system that takes the textual content of an EPUB and feeds it into a speech engine to produce an audio version of the text. TTS experiences can be built in a variety of ways, and may include features like changing the "voice" of the audio, adjusting the speed, providing a visual indicator of the reading position, and customizing what parts of the book are included (i.e. image descriptions, page numbers). There are a wide variety of speech engines out there, and many operating systems for mobile devices and computers have built-in options. TTS speech engines produce audio that generally sounds computer-generated.
SMIL, or media overlays, is a feature of EPUB that allows EPUB creators to include synchronized audio content alongside the textual content of a book. SMIL has a number of features, including highlighting the text as the audio is played and auto-page turning. The EPUB creator has full control of and is responsible for providing the audio, text, and highlighting styles when using SMIL in their books. EPUB files with SMIL included require a reading system that supports SMIL functionality for it to function for the reader.
Screen readers are utilities that are either built into the operating system of the device, such as VoiceOver on MacOS/iOS, or provided through software that can be downloaded or purchased by the user, such as NVDA or JAWS. Screen readers communicate visual content through audio by generating a textual view of the screen or application based on its contents and whether they are correctly marked up. Screen readers allow users to read books similarly to TTS, and the audio output is similarly "robotic" sounding. Screen readers also offer a number of advanced features to users that differentiate it from TTS, such as the ability to navigate and control interactive elements in an application, the ability to navigate content at different levels (headings, paragraphs, words, letters), and user control through customization.
All three features have been around for a long time, with very little change. The biggest change has been the evolution and improvement of available voice engines for TTS. Newer speech engines, especially those that use AI, have been developed to have more "natural" sounding speech output, to replicate well-known voices, and to better handle things like regional accents and pronunciation. While these services have grown in popularity, one drawback to these improved engines is the cost of generating the audio, as the content needs to be processed through specialized platforms, or the speech engine can be downloaded to a device for a fee. Specialized platforms often charge per word, which can be costly at scale. It also limits what readers can do when offline.
TTS, SMIL, and screen readers are not substitutes for one another, and while a reader may use one or more of them to read at the same time or interchangeably it does not mean one replaces the need for another.
Kobo TTS demo
https://cdn.kobo.com/downloads/help_videos/FAQ/Kobo_tts_demo_12-12-2024.mp4
https://cdn.kobo.com/downloads/help_videos/FAQ/Kobo_tts_1min_demo.mp4
|
Notes : The default language of the text-to-speech is pulled from the epub metadata. The format can also provide the region, as in the case of Taiwanese Chinese, “zh-tw”. So, based on what we receive, we set the default language as follow: |