A New Hybrid Approach for Efficient Emotion Recognition using Deep Learning

Hybrid Approach Recognition ░ ABSTRACT- Facial emotion recognition has been very popular area for researchers in last few decades and it is found to be very challenging and complex task due to large intra-class changes. Existing frameworks for this type of problem depends mostly on techniques like Gabor filters, principle component analysis (PCA), and independent component analysis(ICA) followed by some classification techniques trained by given videos and images. Most of these frameworks works significantly well image database acquired in limited conditions but not perform well with the dynamic images having varying faces and images. In the past years, various researches have been introduced framework for facial emotion recognition using deep learning methods. Although they work well, but there is always some gap found in their research. In this research, we introduced hybrid approach based on RNN and CNN which are able to retrieve some important parts in the given database and able to achieve very good results on the given database like EMOTIC, FER-13 and FERG. We are also able to show that our hybrid framework is able to accomplish promising accuracies with these datasets.


░ 1. INTRODUCTION
The facial emotions are the unavoidable region of communication among human beings. They can be used in various forms that cannot be easily perceived by normal eyes. That is why, using given tools, any sign following and preceding them can be subject to recognition and detection. There has been found to be increase in the requirement for the identification of human's emotions in the past decades to grow interest in the human facial emotion recognition in different fields like medicine [1], animation [2], security [3], humancomputer interaction [4,5], and also the diagnosis of autism disorders in urban sound perception [6] and children [7].
The facial emotion recognition could be processed adopting various characteristics like facial expressions [8], EEG [9], and text [10]. The facial expressions are extremely popular in these features because they contain various features and are visible for efficient emotion recognition. Also, the collections of faces are so easy [11].
In the earlier years, with the help of deep learning, recognition results are significantly improved [12]. Various important features have been extracted to get facial emotion recognition system [13,14]. However, facial emotion recognition systems are only depends on the certain face regions like eyes, nose and lips, and other regions like hair and forehead doesn't take much part in the identification of emotions [15]. Therefore, we can say that most of our facial emotion recognition systems depend only on the certain part not in the other regions.
In this research, we introduced the hybrid method for efficient recognition of emotions. This method consists of two deep learning methods like CNN and RNN are used as a feature extraction technique and SVM is used as a classification technique. The main findings of our research are: (1) We introduced hybrid method for efficient facial emotion recognition. (2) We used combination of RNN and CNN for feature extraction and SVM for classification. (3) We used the publicly available datasets like EMOTIC, FER-13, and FERG in our research. (4) We are able to prove our facial emotion recognition system better than the existing systems. (5) We also able to compare our results with all the given datasets.
The remaining part of our paper is structured as follows: similar works have been discussed in section 2. The presented methodology has been described in section 3, assessments and outcomes have been summarized in section 4 and finally concluded in section 5.

░ 2. RELATED WORKS
The first major contribution in facial expression recognition was given by Paul Ekman [16]. His framework was able to identify six basic facial expressions like surprise, fear, joy, sadness, disgust, anger. Later, his framework based on Facial Action Coding System (FACS) was also able to give benchmark in this area [17]. Neutral expression was also incorporated in many datasets gives seven facial expressions.
The previous fact-finding on facial emotion recognition mostly focuses on 2-step traditional approach using machine learning [34]. The first step comprise of important feature extraction using Gabor filters, LBP, LMSP, Zernike moments etch while second step comprise of classification step using random forest, SVM, KNN is used to identify the emotions in the image [35]. These techniques are limited to small datasets while with the addition of long datasets, they unable to perform well. These problems and challenges found in these new images that having sunglasses, partial faces, dynamic background, and occlusions.
The great success of deep learning especially using CNN for the efficient classification of images and some computer vision problem, various researchers group are using deep learning concept to recognize facial emotions [18]. They trained the network model of human faces. They were also able to trained animated faces and other to trained human faces with respect to animated faces [2]. Mollahosseini et al.
proposed a framework based on neural networks for facial emotion recognition using one pooling layer, four inception layers and two convolutional layers [8]. Liu et al. introduced a hybrid system to combine both classification and feature extraction in one looped web accessing two parts for getting feedback. The author has used boosted deep belief network (BDBN) on JAFFE and CK+, and get the best in latest accuracy [20].
Barsoum et al. proposed a framework based on deep learning from acquisition of noisy labels using crowd-sourcing in truth images [21]. Author opt 10 taggers to rename every image in given dataset and applied different cost procedures for DCNN, achieved best result. Han et al. introduced an incremental boosting CNN called IB-CNN, for the increase in accuracy rate for the spontaneous images datasets by increasing discriminative neurons [22]. This method showed the best results at that time. Meng et al. introduced an identity-aware CNN (IA-CNN) based on identity and emotion-sensitive methods to minimize changes in identity and emotion-based information [23].
Fernandez et al. introduced end to end web framework for emotion recognition using attention based model [24]. Want et al. introduced a framework based on self-cure based network which handles uncertainty efficiently and prevents from uncertain facial emotion images [25]. Further, self-cure network put down the uncertainty from both origins: (1) a selfcalculating technique over a small batch for every training sample using regularization of ranking (2) a relabeling technique to update labels of given sample in smallest ranking class. Wang et al. introduced an approach for facial emotion recognition which is efficient in real-world pic and occlusion change. They are able to introduced Region Attention Network (RAN) to importantly acquire the special features of the face region and occlusion in FER. Some of latest research found in facial emotion recognition based multiple attention networks for FER [26], deep learning based self-attention network for FER [27] and latest literature review on FER [28].
All of the discussed research achieved good accuracy over state-of-the-art works facial emotion recognition, but their technique is lack of recognizing special facial emotion recognition for expression detection. In this research, we are going to focus this drawback by introducing a system based on hybridization of RNN and CNN that are used to focus in important features and SVM is used to classify the facial emotions.

░ 3. PROPOSED METHODOLOGY
We have proposed a system based on hybrid technique to recognize the emotions in facial image datasets. The improvement in many hybrid based systems depends on the neurons addition and adding more smooth flow in the networks. They are applicable to the classification of large number of datasets available in the real world. In the area of facial emotion recognition, we are able to show that small layers are capable of work well even in the given small datasets. We have also compared the results with the existing results using different publicly available datasets.
The facial images don't have all the regions importantly useful for the efficient recognition of facial emotions, and in the most of the cases we simply focus on the particular region to get the relevant sense to basic emotion. To overcome this problem, we proposed a system that works on combination of CNN and RNN to get the selected facial regions from the given datasets.

Figure 1: Facial Emotion Recognition System
The figure 1 shows the framework used for introduced system. It basically contains four steps. The first step is to acquire the image from the face datasets. Further, feature extraction step is used to extract the important feature from the image using CNN and RNN. The feature extraction step consists of six layers, with every three following rectified activation procedure and max-pooling layer. They are then following connected layers and dropout layer. The given localization web consists of three convolution layer following pooling layer and a unit and three fully connected layers. The localization network mainly focuses on the important part of The output found from the CNN will be the output for the RNN. The LSTM (Long Short Term Memory) is the kind of RNN that have ability to transform set of input into set of output. We use the LSTM as used by the Donahue et al. [30]. After feature extraction, classification has been done using SVM. The accuracy of SVM for classifying facial images is significantly good.

░ 4. EXPERIMENTS AND RESULTS
We are now able to produce some results using publicly available EMOTIC (18,313 images with 23,788 annotated people) [31], FER-13 (FER2013 consists of approx. 30,000 facial images of distinct expressions with size of 48×48, and the main class of it can be further subdivided into seven types: Zero=Angry, one=Disgust, two=Fear, three=Happy, four=Sad, five=Surprise, six=Neutral. The Disgust facial expression in the dataset has minimum number of images -600, while other classes have around 5,000 samples for each class.) [32], and FERG datasets (FERG is a dataset of cartoon characters containing 55,769 annotated facial images of 6 characters. Every character are categories into 7 types of cardinal emotions, viz. surprise, sadness, neutral, joy, anger, disgust, and fear) [33]. For each case, we are able to train the given model using subpart of dataset, validated on the given validation set and accuracy calculated on test set. The performance analysis has been explained on various datasets in the given section after describing the technique of our training process. We have trained the model in each and every datasets but the variables and parameters are identical in these models. We have initialized given weights using some Gaussian variables with standard deviation of 0.07. We also used L2 regularization technique with the given decay value of 0.0018. It took basically 3-5 hours to train our model. The EMOTIC and FER-13 datasets have equal number of images while FERG contains more images. We used oversampling in order to overcome this imbalance. The data augmentation method is used to train the model on the given larger dataset i.e. FERG.

░ 5. CONCLUSION AND FUTURE WORKS
In this paper, a method is introduced to recognize emotion from different facial images with pose, occlusion, and illumination. From the past research, no such research has been done for the facial emotion recognition based on hybrid method. Despite of training is done in the dataset for still head poses and illuminations, our model is able to adapt all the variations like illumination, color, contrast, and head poses. That is, our hybrid model is able to give better results than traditional machine learning models. Our hybrid model is also able to produce good results with less training datasets in the publicly available datasets like EMOTIC, FER13, and FERG. Our model is able to detect emotion recognition with high accuracy and able to label each of them. The performance of our model for FER13 dataset is best as compare to FERG and EMOTIC datasets. In future, we will incorporate more deep learning methods to improve the results and also try to conduct some more experiments on other available datasets.