Research Article |
A Systematic Approach of Advanced Dilated Convolution Network for Speaker Identification
Author(s): Hema Kumar Pentapati1 and Sridevi K2
Published In : International Journal of Electrical and Electronics Research (IJEER) Volume 11, Issue 1
Publisher : FOREX Publication
Published : 05 February 2023
e-ISSN : 2347-470X
Page(s) : 25-30
Abstract
Over the years, the Speaker recognition area is facing various challenges in identifying the speakers accurately. Remarkable changes came into existence with the advent of deep learning algorithms. Deep learning made a remarkable impact on the speaker recognition approaches. This paper introduces a simple novel architectural approach to an advanced Dilated Convolution network. The novel idea is to induce the well-structured log-Melspectrum to the proposed dilated convolution neural network and reduce the number of layers to 11. The network utilizes the Global average pooling to accumulate the outputs from all layers to get the feature vector representation for classification. Only 13 coefficients are extracted per frame of each speech sample. This novel dilated convolution neural network exhibits an accuracy of 90.97%, Equal Error Rate(EER) of 3.75% and 207 Seconds training time outperforms the existing systems on the LibriSpeech corpus.
Keywords: Log-MelSpectrum
, MFCC
, Dilated Convolution neural networks
, Speaker Identification
, Deep Learning
.
Hema Kumar Pentapati*, Research Scholar, Department of EECE, GITAM School of Technology, Visakhapatnam, India; Email: hpentapa@gitam.in
Sridevi K, Associate Professor, Department of EECE, GITAM School of Technology, Visakhapatnam, India, Email: skataman@gitam.edu.
-
[1] M. M. Kabir, M. F. Mridha, J. Shin, I. Jahan, and A. Q. Ohi, “A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities,” IEEE Access, vol. 9, pp. 79236–79263, 2021, doi: 10.1109/ACCESS.2021.3084299. [Cross Ref]
-
[2] A. Chowdhury and A. Ross, “Fusing MFCC and LPC Features Using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 1616–1629, 2020, doi: 10.1109/TIFS.2019.2941773. [Cross Ref]
-
[3] R. Jahangir et al., “Text-Independent Speaker Identification through Feature Fusion and Deep Neural Network,” IEEE Access, vol. 8, pp. 32187–32202, 2020, doi: 10.1109/ACCESS.2020.2973541. [Cross Ref]
-
[4] S. Nainan and V. Kulkarni, “Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN,” Int. J. Speech Technol., vol. 24, no. 4, pp. 809–822, 2021, doi: 10.1007/s10772-020-09771-2. [Cross Ref]
-
[5] H. Meng, T. Yan, F. Yuan, and H. Wei, “Speech Emotion Recognition from 3D Log-Mel Spectrograms with Deep Learning Network,” IEEE Access, vol. 7, pp. 125868–125881, 2019, doi: 10.1109/ACCESS.2019.2938007. [Cross Ref]
-
[6] Mahesh K. Singh, S. Manusha, K.V. Balaramakrishna and Sridevi Gamini (2022), Speaker Identification Analysis Based on Long-Term Acoustic Characteristics with Minimal Performance. IJEER 10(4), 848-852. DOI: 10.37391/IJEER.100415. [Cross Ref]
-
[7] Z. Liu, Z. Wu, T. Li, J. Li, and C. Shen, “GMM and CNN Hybrid Method for Short Utterance Speaker Recognition,” IEEE Trans. Ind. Informatics, vol. 14, no. 7, pp. 3244–3252, 2018, doi: 10.1109/TII.2018.2799928.
-
[8] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T. Alhussain, “Speech Emotion Recognition Using Deep Learning Techniques: A Review,” IEEE Access, vol. 7, pp. 117327–117345, 2019, doi: 10.1109/ACCESS.2019.2936124. [Cross Ref]
-
[9] X. Wang, F. Xue, W. Wang, and A. Liu, “A network model of speaker identification with new feature extraction methods and asymmetric BLSTM,” Neurocomputing, vol. 403, pp. 167–181, 2020, doi: 10.1016/j.neucom.2020.04.041. [Cross Ref]
-
[10] Mahesh K. Singh, P. Mohana Satya, Vella Satyanarayana and Sridevi Gamini (2022), Speaker Recognition Assessment in a Continuous System for Speaker Identification. IJEER 10(4), 862-867. DOI: 10.37391/IJEER.100418. [Cross Ref]
-
[11] M. Farooq, F. Hussain, N. K. Baloch, F. R. Raja, H. Yu, and Y. Bin Zikria, “Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network,” Sensors (Switzerland), vol. 20, no. 21, pp. 1–18, 2020, doi: 10.3390/s20216008. [Cross Ref]
-
[12] T. W. Sun, “End-to-End Speech Emotion Recognition with Gender Information,” IEEE Access, vol. 8, pp. 152423–152438, 2020, doi: 10.1109/ACCESS.2020.3017462. [Cross Ref]
-
[13] S. Hourri and J. Kharroubi, “A deep learning approach for speaker recognition,” Int. J. Speech Technol., vol. 23, no. 1, pp. 123–131, 2020, doi: 10.1007/s10772-019-09665-y. [Cross Ref]
-
[14] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digit. Signal Process. A Rev. J., vol. 10, no. 1, pp. 19–41, 2000, doi: 10.1006/dspr.1999.0361. [Cross Ref]
-
[15] S. S. Tirumala, S. R. Shahamiri, A. S. Garhwal, and R. Wang, “Speaker identification features extraction methods: A systematic review,” Expert Syst. Appl., vol. 90, pp. 250–271, 2017, doi: 10.1016/j.eswa.2017.08.015. [Cross Ref]
-
[16] S. Hourri, N. S. Nikolov, and J. Kharroubi, “Convolutional neural network vectors for speaker recognition,” Int. J. Speech Technol., vol. 24, no. 2, pp. 389–400, 2021, doi: 10.1007/s10772-021-09795-2. [Cross Ref]
-
[17] T. Lin and Y. Zhang, “Speaker recognition based on long-term acoustic features with analysis sparse representation,” IEEE Access, vol. 7, pp. 87439–87447, 2019, doi: 10.1109/ACCESS.2019.2925839. [Cross Ref]
-
[18] A. Q. Ohi, M. F. Mridha, M. A. Hamid, and M. M. Monowar, “Deep Speaker Recognition: Process, Progress, and Challenges,” IEEE Access, vol. 9, pp. 89619–89643, 2021, doi: 10.1109/ACCESS.2021.3090109. [Cross Ref]
-
[19] M. Chen, X. He, J. Yang, and H. Zhang, “3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition,” IEEE Signal Process. Lett., vol. 25, no. 10, pp. 1440–1444, 2018, doi: 10.1109/LSP.2018.2860246. [Cross Ref]
-
[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2015-August, pp. 5206–5210, Aug. 2015, doi: 10.1109/ICASSP.2015.7178964. [Cross Ref]
-
[21] R. Jahangir, Y. W. Teh, F. Hanif, and G. Mujtaba, Deep learning approaches for speech emotion recognition: state of the art and research challenges, vol. 80, no. 16. Multimedia Tools and Applications, 2021. doi: 10.1007/s11042-020-09874-7. [Cross Ref]
-
[22] T. J. Sefara and T. B. Mokgonyane, “Emotional Speaker Recognition based on Machine and Deep Learning,” 2020 2nd Int. Multidiscip. Inf. Technol. Eng. Conf. IMITEC 2020, 2020, doi: 10.1109/IMITEC50163.2020.9334138. [Cross Ref]
-
[23] S. Chakraborty and R. Parekh, “An improved approach to open set text-independent speaker identification (OSTI-SI),” in 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), 2017, pp. 51–56. doi: 10.1109/ICRCICN.2017.8234480. [Cross Ref]