Abstract
Speech recognition technologies have evolved significantly from early rule-based systems to modern deep learning models; however, conventional audio-only approaches remain constrained by noise interference, diverse accents, and speech impairments, limiting their robustness in real-world applications. Recent research highlights the value of multimodal systems that combine auditory and visual cues, with lipreading offering complementary information where audio signals alone may fail. This study proposes the Wearable Audio-Visual Enhanced Speech-recognition System (WAVESS), a conceptual model implemented in the form of smart glasses equipped with a microphone array and a mini-camera that captures lip movements. The system integrates audio and video inputs through preprocessing pipelines, noise reduction and Mel-Frequency Cepstral Coefficients (MFCC) for audio, and lip region detection and feature extraction for video, before fusing them in a real-time multimodal recognition engine. The fused representation enhances recognition accuracy, adaptability, and resilience in challenging conditions such as noisy environments, hearing impairment contexts, and human–machine interaction scenarios. The model also incorporates connectivity features for wireless or edge-based computation and provides multimodal feedback through augmented reality overlays, audio, or haptic signals. WAVESS demonstrates the comparative advantage of wearable, multimodal systems in accessibility, communication, education, and security applications while addressing scalability and ethical considerations. The conceptual framework establishes a foundation for future prototyping, dataset expansion, and real-world deployment in advancing robust speech recognition research.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright (c) 2025 Tech-Sphere Journal for Pure and Applied Sciences
