AI for Noise Classification

By Paul Sinclair BSc, Application Manager, NTi Audio AG

Artificial Intelligence (AI), Machine Learning (ML), and Pattern Matching are buzzwords that have become part of our everyday language. What do they mean to the world of noise control and acoustics? 

AI refers to the ability of machines to perform tasks that typically require human intelligence, such as the identification of the content of an audio sample, or sound classification. Having a textual representation of the cause of the noise reduces the necessity to listen to the actual audio file. 

In acoustics, AI can be used to identify unwanted environmental noise in real-time, to assist in adhering to noise limit standards, and improve quality of life. Two significant applications for AI-powered noise control are in improving urban planning, and detecting harmful noise levels in workplaces. 

Fortunately, the universe of the types of noises that typically cause environmental disturbances is quite manageable on most CPUs. Pattern Matching thus has a natural application in AI for recognition of noise. Such a machine learning-based noise classification employs supervised learning, where the AI is trained with a set of known noise samples, labelled in sound categories. 

Support Vector Machines (SVMs) is a common model used for classifying noise types. SVMs is a supervised machine learning algorithm used for classification, regression, and outlier detection. It is especially effective for binary classification problems. It works by finding the optimal hyperplane that best separates the data points of different classes. The hyperplane is a decision boundary that separates different classes in an N-dimensional space. The support vectors are defined as data points closest to the hyperplane that influence its position. A score is assigned to the classification, considering the margin of the distance between the hyperplane and the nearest data points. The score is a measure of the accuracy and precision of a best-match classification aggregated across the whole audio sample. Scores are higher when there is less background noise. The text description and associated score are displayed.  

If we try to classify different types of noise (e.g. speech vs. music or sirens vs. background noise), their raw waveforms may not have a clear linear boundary. For such non-linear sounds that may have similar frequency components but belong to different classes, and/or complex structures where noise patterns vary non-uniformly over time, a mathematical kernel function can transform the audio features into a higher-dimensional space where the separation becomes easier. 

A lightweight Convolutional Neural Network (CNN), optimized for real-time applications, is used to extract patterns from spectrogram images of the audio sample. Since a spectrogram is an image that represents an audio signal visually, CNNs can learn patterns in sound frequency over time, making them ideal for audio classification. 

YamNet Scoring 

YAMNet (Yet Another Mobile Network) is a pre-trained deep learning model developed by Google that classifies environmental sounds into 521 audio categories. YAMNet is built using TensorFlow (TF), an open-source framework also developed by Google for machine learning (ML). It allows developers to build, train, and deploy AI models efficiently across different platforms. It provides a reasonably high performance compared to other current ML Frameworks. TF uses Mel-spectrograms to transform sound into an image-like representation. 

The spectrogram is processed through CNN Layers, detecting frequency and temporal patterns, and returns a probability distribution over the 521 audio classes. 

While the score for distinct sounds is likely to be higher than those for complex sounds, as acousticians, we are more interested in the loudest passages of sound, be they convoluted or not. Therefore, within a sound sample, it is appropriate to weight the TF results with the actual sound pressure levels within the sample. 

The scores can also be greatly improved by providing context of the location of the sound source. Distinguishing amongst rotary engines, such a light-engine planes and boats, or heavy-goods vehicles and trains can be a challenge, especially with significant background noise. Certain categories can be excluded given context e.g. there is no lake or railway line close by. 

The categories in YamNet are often subsets of each other. For example, a typical noise classification with scores near a railway line is: 

Vehicle 65 

Rail transport 17 

Train 16 

We interpret this by saying that the system is certain that the source of the noise came from a vehicle of some sort. We are also confident that it was a form of rail transport, most probably a train. 

A successful Noise Classification system will identify different types of environmental and background noise types (e.g. traffic, machinery, wind), as well as detecting audio events such as alarms, sirens, and gunshots. 

A Working Application 

Since 2022, NoiseScout has demonstrated a direct application of AI in noise identification. Audio samples from intervals where levels were high are compared to the library of classified sound samples with a pretrained audio event classifier that predicts audio events based on a dataset on the NTi Audio internal servers. 

See the AI in action in the Alarm List of our demo website at https://www.noisescout.com/demo1