Indian Classical Music Recognition using Deep Convolution Neural Network

Indian


░ 1. INTRODUCTION
Music is one of the most vital components of India's Immaterial Culture legacy, which holds significant importance worldwide.Indian classical music consists of two major subgenres: Hindustani and Carnatic.Hindustani is rehearsed in the northern region of India, while Carnatic is prevalent in the southern region [1].This musical distinction between the two subgenres dates back to the 16 th century.Hindustani music encompasses vocal styles like Thumri, Tarana, Dhrupad, Khayal, Dadra and Gazals, whereas Carnatic music includes Kalpnaswaram, Niraval, Ragam Thana Pallavi and Alpana.Carnatic music employs instruments such as veena, mandolin and mridangam to perform 72 ragas, while Hindustani music uses instruments like sarangi, tabla, santoor and sitar and focuses on six major ragas [2].Carnatic has a single, directed chanting style, whereas Hindustani boasts various sub-styles.In Hindustani music, the vocal aspect takes precedence, whereas both vocal and instrumental elements are equally emphasized in Carnatic [3].
The recent development in music history and retrieval involves the categorization of Raga Music based on Indian Classical Music.The abundance of musical data available on the internet enables this research.However, audio processing, particularly in speech and music, is still in the early stages of development.This study discusses speech processing as a potential foundation for classifying classical music using tools such as MFCCs, Spectrograms, and Scalograms.The research analyzes and extracts sound and music characteristics to categorize various musical genres.The study addresses the early stages of musical signal analysis, including pitch class profiles and acoustic trait-based statistical measurements.Encouraging findings and performance comparisons are provided.Future research will utilize various computer techniques to study diverse musical genres.The analysis of music not only helps us understand society's history and the cultures from which it has evolved but also contributes to the creation of scientific models.While most research focuses on Western music, there are also studies examining the sound of Indian classical music.Carnatic music and Hindustani music are the two primary subgenres of Indian traditional music, each having a significant fan base.Carnatic Music is notably more sophisticated in the manner of notes presented and structured [4].Raga and Talam are the foundations of Indian classical music, where ragas are more intricate than Western music regarding song and scale.Ragas are composed of notes organized to evoke specific moods.Swara is the musical term for a note in Carnatic music, with each note associated with a specific frequency [5].In Carnatic music, a song's rhythm is based on Talam, which denotes the order of syllables and the musical composition's pace.Talam is indicated through hand gestures in Carnatic music.This study uses Raga patterns to distinguish between Hindustani and Carnatic classical music, considering the musical note as the fundamental building block of Indian traditional music.The intervals between notes, known as swara, are characterized by the ratios of their fundamental frequencies.There are seven melodic notes: Sa, Ri, Ga, Ma, Pa, Dha and Ni, with frequencies that can be additionally isolated into semitones or microtones.Hindustani and Carnatic music use different scale types, namely, the 12-note scale and the 16-note scale [1][4] [8].
Hindustani music employs a 12-note scale, while Carnatic music uses a 16-note scale.Carnatic music has identified 12 different frequency components [9] [10].Various deep master frameworks have been presented for music recognition in the past.For instance, Dipti Joshi's work with two classifiers, KNN and SVM, on Ragas Yaman and Bhairavi achieved a 90% result on the database [11].Choi et al. proposed a music labeling plan given CNN [12].Abdul et al. combined 2-D DCNN with a Melspectrogram of the music sign to give inert elements and a higher component portrayal capacity [13].Other researchers have utilized techniques like CNN, CRNN, LSTM, and hybrid classification approaches to study and classify music genres [14] to [21].Their studies offer valuable insights into understanding and categorizing Indian classical music.Indian classical music is widely recognized and streamed on social media and community platforms.Indian raga classification is also challenging due to the variability in the languages and corpora of the songs.Still, very few researchers have focused on the classification of Indian Ragas.Thus, there is a need to analyze the different Indian ragas that play an imperative role in the music industry.Musical speech has a wide variety of intonation, pitch, timbre, and prosody, which is essential to capturing and describing the distinctiveness of the signal.The musical speech representation is challenging because of the variety of ragas in Indian classical music.Thus, this work presents the multiple acoustic features that combine the spectral, time domain, and voice quality features for describing the musical signal.The previous systems have used heavy and complicated DL frameworks, increasing the systems' computational intricacy.Thus, there is a need to provide a lightweight DL architecture that needs lower trainable parameters and lower computational intricacy.This paper offers Indian Raga grouping utilizing a DCNN.The commitments of the proposed work are summed up as follows: • Representation of the Indian raga music signal using a set of spectral, temporal, and voice quality features to characterize the impact of raga on the voice signal.
• Representation of musical speech using multiple acoustic features that encompasses spectral, time-domain, and voice quality features to characterize the distinctiveness of the Indian ragas.
• Implementation of a five-layered DCNN to improve the distinctiveness of the traditional multiple acoustic features for enhancement in raga classification.
The rest of the article is arranged as follow: Section 2 provides the methodology in details that focuses on various spectral features, temporal features, voice quality features and the DCNN model used for proposed Raga classification.Section 3 depicts the details about experimental results and discussions on the results, Later, section 4 gives concise findings and the future scope for potential improvement in the proposed system.

░ 2. PROPOSED METHODOLOGY 2.1 MTMFCC Features
In the generalized MFCC, a Hamming window with higher variance is utilized, but it may fail to capture subtle variations in the speech signal frames.The signal is filtered during the stage of pre-emphasis to reduce the noise.In the multi-taper windowing technique, the entire signal is divided into frames of 40ms.Signal is converted in frequency domain using DFT.The linearly scaled signal is subsequently transformed to Mel frequency, which corresponds to understanding human hearing.The transformed signal is changed to the time domain using DCT to reduce signal redundancies.For feature extraction, 13 cepstral coefficients are chosen, as they are computed after log filter-bank power has been calculated over the frames.Figure 2 illustrates the MT-technique MFCC.In comparison, the multitaper MFCC utilizes different tappers with varying characteristics for windowing the signal, which catch minute variations in the signal [22][23].
The weights of SWCE tapers are computed using equation 2 [31].

LPCC
Different Indian ragas depict different emotions.To record information related to emotions expressed through vocal tract characteristics, this study employs LPCCs (Linear Predictive Cepstral Coefficients).A frame shift of 10ms and a 10th order LP analysis on the speech signal provide thirteen LPCCs for each speech frame of 20ms [24] [25].
The LPC considers the previous samples knowledge for the estimation of future n th coefficients using equation 4.
Where a_1, a_2, ….., a_p are the constants over the music signal.The error between actual sample x (n) and predicted x ̂(n) is computed using equation 5.
To find the distinctive predictive constants, the sum of the squared difference of error ( e n ) between x ̂(n) and x(n) is calculated utilizing Equation (6).Here, m characterizes the total samples in music frame.

Formants
Formants are contributing to defining the resonance produced by the aroha (gradual increase of voice) and avroha (gradual diminution of voice) during raga singing.The study considers three formant frequencies along with the standard deviation and mean of the formants [26][27].The formants, its mean and standard deviation are provided by equation 11-13.

Pitch Frequency
The voice signal's fundamental period is known as pitch.It represents the perceived equivalent frequency and reflects that frequency when vocal cords vibrate for producing sound.The pitch represents the voice texture of the singer while singing ragas [28][29].

░ 3. ZCR
The ZCR allows assessing both voiced and unvoiced data, providing insights into the number of times the waveform switches polarity in a given time period [30][31].
The sign gives positive one and zero for positive amplitude and negative amplitude respectively as given in equation 14.

Jitter and Shimmer
Jitter and shimmer are terms used to describe variations in amplitude and frequency of the emotional signal caused by periodic vibrations of vocal cord.They represent the breathiness, hoarseness and roughness of the emotional voice [32].The jitter (Jt) and shimmer (Sh) are computed using equation 15 and 16 where A, T and N represents the peak-topeak amplitude, time period and number of periods.

Spectral Kurtosis
The spectral kurtosis (SK) shows transients and their spectral domain positions.It illustrates how arousal and emotion valence affect the speech spectrum's non-Gaussianity or flatness around its centroid.Equation 17 estimates voice signal spectral kurtosis.
Here, μ 1 and μ 2 symbolizes the spectral centroid and spectral spread, respectively, s k is spectral value over k bins, and b 1 and b 2 are the inferior and higher limits of the bins where SK of speech is estimated.

DCNN Model
The proposed DCNN accepts an acoustic feature vector with dimensions of 318×1, which encompasses various features of Indian raga.The Convolution operation of DCNN provides the discriminant features of the input acoustic features.It involves sliding the convolution filter () over the acoustic feature vector, one sample at a time.The convolution operation for feature vector , with  number of features, and the convolution kernel  with dimensions of 1×3, is given by equation 18.To maintain the dimensions of the convolution feature map, the input feature vector is zero-padded with 2 zero values.The ReLU layer improves the non-linearity of the signal by eliminating negative values, as described in equation 19.The MaxPool layer reduces the feature dimensions and addresses the problem of over fitting, as mentioned in equation 20 [11].

𝑹𝒄𝒐𝒏𝒗(𝒌) = ∑ 𝒇𝒆𝒂𝒕(𝒏). 𝒇𝒌(𝒏 − 𝒌)
= (18)    (, ) = ( 0, ()) The Softmax layer functions as the final classification layer in the DCNN.It takes the entire input vector and converts it into an output vector, where each value represents the probability of the input sample having a place with a particular class.The Softmax function ensures that the probabilities are normalized, meaning that the addition of all probabilities in the output vector is equal to one.By using the Softmax function (as described in equation 21), the DCNN can provide a probability distribution over the classes, allowing it to make confident predictions about the input samples class memberships.
Softmax function is similar to sigmoid function only difference is that softmax consider vector as the input whereas sigmoid value considers scale value.The sigmoid function can be given by equation 22.

░ 4. EXPERIMENTAL RESULTS AND DISCUSSIONS
The proposed scheme's results are evaluated using various quantitative and qualitative metrics, which include recall, precision, F1-score and accuracy.The equations of quantitative and qualitative metrics [11] are as follows:  The DCNN demonstrates an overall accuracy of 89.38%, outperforming the SGDM (78.68%) and RMSProp (61.88%) learning algorithms.The ADAM optimization algorithm combines the strengths of SGDM (good performance for sparse gradient problems in natural language processing) and RMSProp (ability to work better for noisy and non-stationary signals).The DCNN with ADAM optimization shows significant improvements in accuracy, achieving 2.17% and 29.91% higher performance compared to DCNN with SGDM and RMSProp, respectively, for the eight-class raga classification.Furthermore, the DCNN with ADAM achieves the highest accuracy of 100% for Asawari, Bhairavi, Malkans and Yaman ragas.However, it achieves the lowest accuracy of 41.41% for the Bageshwari raga.

░ 4. CONCLUSION AND FUTURE SCOPES
This paper aims to show an Indian raga classification system that utilizes various acoustic features, including spectral, time domain and voice quality features, along with a (DCNN).The use of the proposed DCNN-ADAM results in an overall accuracy of 89.38% for classifying eight ragas, namely Asawari, Bageshwari, Bhairavi, Bhoopali, Darbari, Malkans, Sarang and Yaman.The lightweight DCNN architecture helps capture various intonation changes, pitch variations, and discriminative attributes in classical music, contributing to the improved distinctiveness of low-level acoustic features.In the future, the proposed model's output can be generalized more for multiple Indian languages.In the future, the issue of data scarcity can be tackled using the data augmentation technique.

Figure 1 :Figure 2 .
Figure 1: shows diagram of the proposed work that encompasses DCNN framework for Indian Raga classification.

Figure 3 .
Figure 3. Experimental results (a) Multi-taper Windows (b) Mel clear out bank (c) Mel log filter bank electricity (d) Mel frequency Cepstrum

Figure 5 .
Figure 5. Training loss for the proposed DCNN ░ Table 3: Performance of proposed DCNN for raga classification

(
78%), respectively.The previous methods have considered only 2 or 4 ragas for the experimental evaluations.Proposed system's performance compared with the conventional machine learning classifiers to analyze the effect of set of features on classifiers such as Support Vector machine (SVM), K-) and Random Forest (RF).It provides superior accuracy for 8 class classification of ragas.

Table 1 : Configurations of DCNN Layer Number of Filters Stride Activation Maps Padding
(22)−(22)After passing through the Softmax layer, the precise raga is recognized based on the record of the greatest Value in the neuron of the Softmax output layer.The index with the highest probability corresponds to the predicted raga class.Furthermore, the DCNN is trained using three different learning algorithms: ADAM, SGDM and RMSProp.These learning algorithms play a crucial role in adjusting the model's parameters in training process, enabling it to learn from this data and make accurate predictions.The DCNN configurations are provided in table 1.The training configuration and hyperparameters of DCNN are provided in table 2.░

Table 4 : Performance of proposed DCNN for raga classification Author and Year Feature Extraction Classifier Accuracy (%) Ragas
The output of the proposed system is compared with the conventional state of arts used for Indian classical music recognition as given in table 4. It provided 89.34% accuracy for the 8 ragas for suggested lightweight DCNN architecture, which is superior over the RF (84.50%),SVM (84.20%), and KNN