This is a Plain English Papers summary of a research paper called Conformer-Based Speech Recognition On Extreme Edge-Computing Devices. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

Introduction

The paper discusses optimizations to enable end-to-end (E2E) fully neural network-based automatic speech recognition (ASR) systems on resource-constrained devices like smartphones, wearables, and home automation devices. E2E ASR systems have several advantages over conventional hybrid hidden Markov model-based ASR, including a simplified training procedure, better word error rate performance, and the ability to better utilize hardware accelerators like GPUs, TPUs, and Apple's Neural Engine for high throughput and energy efficiency.

Unlike conventional ASR systems that require computationally expensive decoders, E2E ASR systems can operate fully offline, saving cloud computing resources and providing stronger user privacy. However, deploying these systems on resource-constrained devices presents challenges due to hardware limitations.

The paper explores multidisciplinary solutions, such as memory-aware network transformation, model structural adjustment, and numerical optimizations to address inference stability. It specifically focuses on leveraging the inference efficiency of specialty hardware accelerators.

The authors derive a theory to numerically stabilize the computation of layer normalization on hardware accelerators. This stabilization technique does not require model retraining and is applicable to the computation of any Lp-norm.

Prior Work

The provided text discusses various approaches to improve the computational efficiency of the Transformer architecture, which has garnered significant interest. It highlights several notable works that focus on optimizing the model architecture itself.

Linear Transformer (Katharopoulos et al. 2020) is a key technique that addresses the computationally expensive softmax function within the attention mechanism, which is also susceptible to numerical overflow issues. Alternative normalization methods, such as those discussed in Hoffer et al. (2018) and Zhang and Sennrich (2019), are explored to enhance computational efficiency and numerical stability in low-precision environments.

The text mentions principles for optimizing Transformers on Apple hardware, as outlined in Apple (2022), which are generally applicable to similar devices.

In the domain of speech recognition, Squeezeformer (Kim et al. 2022) is a seminal work that focuses on efficiency optimization, particularly for the Conformer architecture. It employs depthwise separable convolution subsampling, inspired by MobileNet (Howard et al. 2017), to substantially reduce computation.

The majority of prior work concentrates on improving training efficiency by modifying the existing model architecture, requiring model retraining to achieve efficiency gains. In contrast, the research discussed in the text primarily focuses on post-training, inference-only processes while avoiding model retraining whenever possible.

Model Architecture

The paper describes the foundational architecture of their model, which is built on the Conformer connectionist temporal classification (CTC) automatic speech recognition (ASR) system. The key components are:

Acoustic Encoder:

Stacks transformer and convolution layers alternatively to capture long-term dependencies and local patterns in speech frames.
Uses relative sinusoidal positional encoding in the transformer layers.
Adopts a chunk-based attention strategy to balance accuracy and dependency on future audio frames for streaming on edge devices.

Objective Functions:

Trains the model using a multitask objective combining Connectionist Temporal Classification (CTC) and Attention-based Encoder Decoder (AED).
Uses only CTC for decoding.

To reduce computational requirements for resource-constrained devices, the paper substitutes vanilla 2D convolution with depthwise separable convolution in the subsampling module. This change reduces the computational bottleneck from 32.8% to 4.0% while maintaining word error rate (WER) performance.

The paper notes that while depthwise separable convolution is known for its computational efficiency and small memory footprint, its effect on reducing the dynamic range of outputs needs further study. Reducing dynamic range is important for low-precision hardware accelerators in edge devices to avoid overflow issues.

Performance Optimizations

The section describes specific optimizations implemented to achieve high performance execution of a modified Conformer model on smartphones and wearable devices. It follows four principles for optimizing transformers on the Apple Neural Engine (ANE):

Picking the right data format: Using the (B, C, 1, S) {Batch, Channel, 1, Sequence} tensor shape to align with ANE's architecture.
Chunking large intermediate tensors: Using split and concatenation operations to divide tensors into smaller chunks for better cache utilization.
Minimizing memory copies: Minimizing reshape and transpose operations, and representing batch matrix multiplications using Einstein summation layers.
Handling bandwidth-boundness: Carefully benchmarking performance with different batch sizes and sequence lengths to account for memory fetch costs.

The authors adhered to these principles, including transposing inputs, using split/concat, and replacing batch matrix multiplications with Einstein summations.

It then describes an optimal low-precision pre-normalizer technique to combat numerical instabilities in layer normalization on low-precision hardware. Theoretical analysis shows this pre-normalizer maps any input distribution to the precise maximum value supported by the hardware's low-precision format.

Finally, it introduces conditional re-scaling for softmax layers when hardware lacks native exponential operation support. This re-scales inputs into a range where lookup tables can provide accurate approximations.

The techniques aim to maximize performance on mobile hardware accelerators like ANE while ensuring numerical accuracy despite low-precision constraints.

Experiments and Results

The provided text describes the experimental setup and results for evaluating different convolutional neural network architectures for speech recognition on mobile devices. Here are the key points:

Setup:

The training data contains 17,000 hours of audio-transcript pairs from virtual assistant queries.
Two model architectures (conv2d6 and dws2d6) were trained with different subsampling strategies.
Two additional models (conv2d6x22 and dws2d6x22) used a scaling factor from the transformer work.
Experiments were conducted on iPhone XR and Apple Watch Series 7 devices.

Performance:

Models running on CPUs did not meet the real-time factor (RTF) target of 0.5.
Using hardware accelerators brought the RTF down by an order of magnitude, achieving the 0.5 target.
On the Apple Watch, the accelerated models were 5.26 times faster.

Energy:

Hardware accelerators also reduced energy consumption by an order of magnitude compared to running on CPUs.

Numeric Stability:

Depthwise separable convolutions (DWS) had a smaller dynamic range than vanilla 2D convolutions.
Removing the scaling factor further improved numeric stability.
Overflow statistics showed the need for techniques to prevent overflows.

Quality:

There was negligible difference in word error rate (WER) between FP16 and FP32 precision.
DWS and vanilla convolutions yielded similar WER.
The scaling factor did not improve WER and caused overflow issues.

Conclusions

Here is a summary of the provided text:

The paper discusses optimizations made to Conformer CTC (Connectionist Temporal Classification) automatic speech recognition models to enable their deployment on resource-constrained devices like mobile phones and wearables. Through architectural, model-level, and numerical adjustments, the authors demonstrate that these models can achieve real-time or faster performance while consuming less energy, all without sacrificing recognition accuracy. The paper's findings on numerical stabilization techniques have broader applicability beyond just speech models, extending to various deep learning architectures and computing tasks.

If you enjoyed this summary, consider subscribing to the AImodels.fyi newsletter or following me on Twitter for more AI and machine learning content.