Company News Industry Information

Voiceprint recognition AI algorithm box can recognize no less than 50 natural sound subcategories

2024-08-20 07:24:41

The emergence of human language is a complex physiological and physical process between the human language center and the vocal organs. The vocal organs used by humans during speech - the tongue, teeth, throat, lungs, and nasal cavity - vary greatly in size and shape from person to person, so the voiceprint patterns of any two individuals differ. The acoustic characteristics of each person?s speech have both relative stability and variability, and are not absolute or unchanging. This variation can come from physiology, pathology, psychology, simulation, disguise, and is also related to environmental interference. However, due to the fact that everyone?s vocal organs are not the same, in general, people can still distinguish the voices of different people or determine whether they are the same person?s voice, even if it were other organisms or objects. The speech signals of the same type of sound also carry unique sound wave spectra. Extract and classify and recognize. This is voiceprint recognition technology.

The main tasks of voiceprint recognition include speech signal processing, voiceprint feature extraction, voiceprint modeling, voiceprint comparison, and discriminative decision-making.

Technical features
1. Noise sound type recognition refers to the classification of noise in the environment through machine learning algorithms to determine its possible sources and types. For example, distinguishing machine noise, human voice noise, traffic noise, etc.
The application of I in noise sound type recognition is mainly reflected in deep learning techniques, especially the application of convolutional neural networks. Firstly, a large amount of sound data needs to be collected and trained using deep learning algorithms to extract useful features and optimize the model. Then, the input sound is compared with a known sound model, and the identity of the input sound is determined by calculating the distance or similarity between the features of the input sound and the model.
In addition, for specific application scenarios such as indoor and outdoor scene recognition, public place and office scene recognition, specialized audio processing front-end parts can also be used.
4. It is worth noting that although AI has broad application prospects in noise sound type recognition, it still faces many challenges in practical applications, such as the complexity of noisy environments, the diversity of speech signals, and the optimization of models. Therefore, how to improve the accuracy and robustness of noise sound type recognition remains an important direction for future research.
technology roadmap 
1. Establish an audio sample library with wide coverage, and classify sounds into five major categories based on different noise supervision units, with no less than 50 sound subcategories;
By using deep learning AI technology to analyze and process noisy samples, extract voiceprint features from them, and construct a voiceprint recognition model;
2. Continuously testing and optimizing to improve the accuracy and robustness of the voiceprint recognition model, enabling it to accurately recognize voiceprint types in various environments and conditions;

3. Use deep convolutional neural network algorithm to achieve recognition and classification of audio events. Extract time-domain and logmel frequency-domain features from audio through convolution operation, and combine the time-domain and frequency-domain features of the waveform as effective features of the audio. Further obtain feature maps through convolution sampling, and finally achieve feature classification using a fully connected network classifier.

technical parameter
Voiceprint recognition model based on Pytorch: The model is a deep learning based speaker recognition system that incorporates channel attention mechanism, information propagation, and aggregation operations into its structure. The key components of this model include multiple frame level TDNN layers, a statistical pooling layer, and two sentence level fully connected layers. In addition, it is equipped with a softmax layer and a loss function of cross entropy.
Feature Extraction: Pre emphasis ->Split addition window ->Discrete Fourier Transform ->Mel filter bank ->Inverse Discrete Fourier Transform ->Image
Model training set:>15000 training samples
Sound types: Sound types are mainly divided into five categories: domestic noise, construction noise, industrial noise, traffic noise, and natural noise, including no less than 50 subcategories such as thunder, wind, knocking, insect and bird chirping
Voiceprint recognition accuracy: ≥ 85%
Recognition response rate:>3s
Calling method: Supports cloud calling or local terminal calling
Technical agreement: Supports HTTP protocol


Keyword:

Recommend