Wavelet-Based MFCC and CNN Framework for Automatic Detection of Cleft Speech Disorders
DOI:
https://doi.org/10.35746/jtim.v7i3.780Keywords:
cleft lip and palate, speech recognition, wavelet-mfcc, Convolutional Neural NetworkAbstract
Cleft Lip and Palate (CLP) is a congenital condition that often results in atypical speech articulation, making automatic recognition of CLP speech a challenging task. This study proposes a deep learning-based classification system using Convolutional Neural Networks (CNN) and Wavelet-MFCC features to distinguish speech patterns produced by CLP individuals. Specifically, we investigate the use of two wavelet families Reverse Biorthogonal (rbio1.1) and Biorthogonal (bior1.1)—with three decomposition strategies: single-level (L1), two-level (L2), and a combined level (L1+2). Speech data were collected from 10 CLP patients, each pronouncing nine selected Indonesian words ten times, resulting in 900 utterances. The audio signals were processed using wavelet-based decomposition followed by Mel-Frequency Cepstral Coefficients (MFCC) extraction to generate time-frequency representations of speech. The resulting features were input into a CNN model and evaluated using 5-fold cross-validation. Experimental results show that the combined L1+2 decomposition yields the highest classification accuracy (92.73%), sensitivity (92.97%), and specificity (99.04%). Additionally, certain words such as “selam”, “kapak”, “baju”, “muka”, and “abu” consistently achieved recall scores above 0.94, while “lampu” and “lembab” proved more difficult to classify. The findings demonstrate that integrating multi-level wavelet decomposition with CNN significantly improves the recognition of pathological speech and offers promising potential for clinical diagnostic support.
Downloads
References
I. Alois and R. A. Ruotolo, “An overview of cleft lip and palate,” JAAPA, vol. 33, no. 12, pp. 17–20, Dec. 2020, https://doi.org/10.1097/01.JAA.0000721644.06681.06.
F. R. Larangeira et al., “Speech nasality and nasometry in cleft lip and palate,” Braz. J. Otorhinolaryngol., vol. 82, no. 3, pp. 326–333, May 2016, https://doi.org/10.1016/j.bjorl.2015.05.017.
S. A. A. Yusuf and M. I. Dinata, “Features Extraction on Cleft Lip Speech Signal using Discrete Wavelet Transformation,” JTIM J. Teknol. Inf. Dan Multimed., vol. 6, no. 2, Art. no. 2, July 2024, https://doi.org/10.35746/jtim.v6i2.545.
M. Telmem, N. Laaidi, Y. Ghanou, S. Hamiane, and H. Satori, “Comparative study of CNN, LSTM and hybrid CNN-LSTM model in amazigh speech recognition using spectrogram feature extraction and different gender and age dataset,” Int. J. Speech Technol., vol. 27, no. 4, pp. 1121–1133, Dec. 2024, https://doi.org/10.1007/s10772-024-10154-0.
R. B. Pittala, B. R. Tejopriya, and E. Pala, “Study of Speech Recognition Using CNN,” in 2022 Second International Conference on Artificial Intelligence and Smart Energy (ICAIS), Feb. 2022, pp. 150–155. https://doi.org/10.1109/ICAIS53314.2022.9743083.
P. N. Sudro, R. K. Das, R. Sinha, and S. R. M. Prasanna, “Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition,” Oct. 02, 2021, arXiv: arXiv:2110.00797. https://doi.org/10.48550/arXiv.2110.00797.
M. Geng et al., “Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition,” in Interspeech 2021, ISCA, Aug. 2021, pp. 4793–4797. https://doi.org/10.21437/Interspeech.2021-60.
A. Subasi, “Chapter 3 - Machine learning techniques,” in Practical Machine Learning for Data Analysis Using Python, A. Subasi, Ed., Academic Press, 2020, pp. 91–202. https://doi.org/10.1016/B978-0-12-821379-7.00003-5.
M. Labied, A. Belangour, M. Banane, and A. Erraissi, “An overview of Automatic Speech Recognition Preprocessing Techniques,” in 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand: IEEE, Mar. 2022, pp. 804–809. https://doi.org/10.1109/DASA54658.2022.9765043.
A. Koduru, H. B. Valiveti, and A. K. Budati, “Feature extraction algorithms to improve the speech emotion recognition rate,” Int. J. Speech Technol., vol. 23, no. 1, pp. 45–55, Mar. 2020, https://doi.org/10.1007/s10772-020-09672-4.
Y. Huang, K. Tian, A. Wu, and G. Zhang, “Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition,” J. Ambient Intell. Humaniz. Comput., vol. 10, no. 5, pp. 1787–1798, May 2019, https://doi.org/10.1007/s12652-017-0644-8.
D. Baker, “Mahmood A. & Köse U. / AAIR vol 1:1(2021) 6-12,” vol. 1, 2021.
Department of Computer Sciences, Ajayi Crowther University, Oyo, Nigeria. and J. A. Ayeni, “Convolutional Neural Network (CNN): The architecture and applications,” Appl. J. Phys. Sci., vol. 4, no. 4, pp. 42–50, Dec. 2022, https://doi.org/10.31248/AJPS2022.085.
F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Mach. Learn. PYTHON, 2011.
R. T. Nakatsu, “Validation of machine learning ridge regression models using Monte Carlo, bootstrap, and variations in cross-validation,” J. Intell. Syst., vol. 32, no. 1, Jan. 2023, https://doi.org/10.1515/jisys-2022-0224.
Department of Computer Science and Informatics, University of Energy and Natural Resources, Sunyani, Ghana, I. K. Nti, O. Nyarko-Boateng, and J. Aning, “Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation,” Int. J. Inf. Technol. Comput. Sci., vol. 13, no. 6, pp. 61–71, Dec. 2021, https://doi.org/10.5815/ijitcs.2021.06.05.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Muhammad Hilmy Herdiansyah, Syahroni Hidayat, Nur Iksan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.




