Facial Expression Analysis and Estimation Based on Facial Salient Points and Action Unit (AUs)

░ ABSTRACT: Humans use their facial expressions as one of the most effective, quick, and natural ways to convey their feelings and intentions to others. In this research, presents the analyses of human facial structure along with its components using Facial Action Units (AUs) and Geometric structures for identifying human facial expressions. The approach considers facial components such as Nose, Mouth, eyes and eye brows for FER. Nostril contours such as left lower tip, right lower tip, and centre tip are considered as salient points of Nose. Various salient points for Mouth are extracted from the left and right end point, upper and lower lip mid points along with curve. These salient points are extracted for all facial expression of the same subject considering neutral face as reference. The Geometric structure for neutral face is mapped along with other facial expression faces. The deformation is estimated using the Euclidean distance. The classification algorithms such as LibSVM, MLP, RF has achieved classification accuracy of 86.56% on an average. The findings of the experiments show that the extraction of picture characteristics is more efficient in terms of computing and gives promising outcomes.


░ 1. INTRODUCTION
Building automatic FER system is a complex task, since the human face varies from person to person due to various factors such as age, culture, ethnicity, etc. Apart from these, other factors like facial hair, glasses, translation, scaling, rotation, illumination make this task more complex. Thus, there is a need for an effective approach which can handle the above mentioned factors. Further, accurate localization of facial points is a key step for extracting feature to categorize facial expressions. However, in many real time applications, the performance of feature extraction approach also depends on the environmental factors such as illumination, brightness, etc. If the illumination is non-uniform, extracted feature will be inaccurate and results in poor recognition rate. Hence, it is necessary proposed effective FER approach, which may use methods like DCT Normalization (Chen and Er, 2006 [1]), Histogram Equalization (Gangolli et al, 2019 [2]) etc. before facial feature extraction. The suggested FER system is shown in Figure 1. In this paper, the proposed approach analyses and estimate the facial expressions using facial landmarks and its salient points. The proposed approach considers the JAFEE Dataset for analysis. The dataset comprises of 256 x 256-pixel grayscale photos of ten Japanese women. There are a total of 213 photos in this collection, with seven different face emotions shown, including neutral, surprise, happy, sad, and disgusted. There are 30 photos of rage, 29 photographs of disgust, 31 images of joy, 32 images of fear, 31 images of grief, 30 images of surprise, and 30 images that are neutral. The Dataset is cross validated into training and testing set. The proposed approach detects the face using Viola Jones (Jain and Choudhary, 2018 [3]). The detected face is cropped using background subtraction and rescaled for 132x132 pixels. The normalization and filtering techniques are used to handle illumination invariant, occlusions, noise etc., and using morphological Sobel edge detection (Singh and Singh, 2016 [4]).

Figure 1. Schematic Diagram of the Proposed FER System
The next phase is feature extraction phase. The approach considers eyes, eyebrows, nose and mouth as facial components. The rescaled sample images are processed for detecting facial landmarks using Cascaded CNN algorithm.

Facial Expression Analysis and Estimation Based on Facial Salient Points and Action Unit (AUs) ░ 2. LITERATURE SURVEY
A person's emotional and mental condition are represented by their facial expressions, which are very important. Based on the Psychological survey, Language conveys 7% of overall information. 38% by language auxiliary such as speech rhythm, tone, etc. and 55% by facial expression. Hence, FER is considered as an important research topic in terms of theoretical and practical research and the life application values. An emotional feeling, say happy, sad, fear, etc. are expressed in terms facial expression and these are caused due to occurrence of various situations. The author Ekman, (1992 [5]) determined 6 basic emotions from the human face such as anger, happiness, surprise, sad, disgust and fear that are expressed frequently through facial expressions. Automatic FER attracted a large number of researchers in the last few decades due to the rise of numerous applications in the field of computer vision, robotics, medical, HCI, etc. However, there exist certain limitations in FER system such as traditional approaches may not handle this real-time huge data. Apart from this, lighting conditions, noise, occlusion and head position fluctuations are some of the elements that impact the FER systems resulting in uncertainty and ambiguity. Many solutions employed handmade characteristics for FER and hence need additional efforts in terms of programming and calculation expense.
In this paper, an architecture specification of FER system is proposed to handle the above mentioned research issues. The proposed approach considers six basic emotions mentioned by Ekman, 1992 [5]. The approach builds the FER model by considering facial components such as eyes, eyebrows, nose and mouth features. The approach proposes geometric-based feature extraction for these facial components considering neutral face as reference image. The geometric structure and their properties are exploited for each of the facial components. For instance, the lips are mapped to parabola and its properties; nose is mapped to tetrahedron/pyramid and its properties, etc. The similarity between the geometric properties of neutral face and geometric properties of other expression faces are analyzed. The degree of deformation for the facial expressions is calculated.

Reviews on Pre-Processing Techniques in FER System
The image pre-processing step improves the FER system's performance. This process deals with input samples data and carried out before feature extraction process. Image preprocessing includes various techniques in handling the raw data such as image detection, cropping, distortion removal, image scaling, normalization, segmentation, additional enhancement processes, etc. to improve the image frames. Below section represents literatures of various pre-processing techniques.

Reviews on Face Detection and Tracking
It is a technique for extracting the face from a picture by using Face Detection. It's important to do this step since each picture has a distinct size and orientation. Due to the observed person's motions, the pictures captured by the cameras will be of varying sizes and angles. As a result, searching for a certain pattern in the picture becomes time-consuming. Detecting faces in photos with noise and occlusion may be challenging. Face tracking will be difficult if the backdrop and lighting circumstances of the facial picture are complicated. If the test photograph is taken in a different lighting environment, facial expression recognition is more likely to fail. Facial landmark points cannot be detected accurately because of Illumination factors. To avoid these factors, accurate face detection in preprocessing is required. This section discusses various approaches related to detecting, cropping and scaling the facial images. The authors Li and Lyu, (2016) [6] depth image feature-based face detection approach has been suggested. HOG feature is used as main feature to train the image for classification. The detected image is divided into cells and gradient is calculated pixel by pixel to produce the Histogram of Oriented Gradient. The gradient feature vector is normalized block by block. All the HOG feature vectors are collected using zigzag scan sliding window. The detection rate is further improved by combining LBP feature with HOG feature and the features are classified using Advanced SVM classifier. The authors SumaLakshmi and Vasuki, (2020) [7] have classified the models into edge, dimensional and middle for facial detection. A black-and-white picture is utilised for the design of Viola and Jones models. Upon importing it onto the image, the pixels inside the black and white portion of rectangle is removed to detect the facial features. Viola Jones is used to identify the face, then the LBP histogram from three sections of the face is analysed and the final histogram from the three regions is returned. The authors Narayan and Deshpande, (2017) [8] detected the face using Viola Jones algorithm, and then later the Fusion of PCA is used to identify images and FFNN. The authors Wagh et al., (2015) [9] captured the image picture captured with a high definition camera; the image is preprocessed using Histogram equalisation. Face detection is accomplished via the use of the Viola-Jones algorithm. Skin classification technique is done to improve the face detection algorithm.

Reviews on Handling Head posture, Illumination, Occlusion, Scale and Translation Factors
Image cropping is an essential step in the FER system as it is important to crop the relevant section from the image for feature extraction. Images are generally dominated by noisy data, occlusions, disturbing backgrounds, and illumination conditions. Cropping is a process where it removes the image background and presents the actual image useful for feature extraction and classification. The author Kumar et al., (2012) [10] has proposed an approach where an image is given as an Website: www.ijeer.forexjournal.co.in Facial Expression Analysis and Estimation Based on Algorithm input to detect some faces. It is suggested to use a face detection algorithm in conjunction with an occlusion detection method to identify all of the occluded faces. This experiment is conducted in a classroom with thirty students and the proposed algorithm detected the faces well without occlusions. The author Kavinmathi et al., (2016) [11] has proposed an approach to detect the face for automatic online attendance. The author has used background subtraction method initially on input image and then later the face is cropped and detected using Eigen value method to recognize the faces.
The authors Zeng and Veldhuis, (2020) [12] has explored the occlusion problem and presented various face recognition techniques which deals with occlusion robust feature extraction and occlusion recovery techniques. The author Kahatapitiya et al., (2019) [13] has proposed a method to remove occlusions in an image automatically. This method identifies the occlusions in an image by subtracting the foreground and background classes using vector embedding and discards them through in painting. Face Restoration is an important process to locate occluded regions in the image. The authors Srinivasan and Balamurugan, (2015) [14] used Gappy Principal Component Analysis (GPCA) method for 3D image occlusion in which the image is masked initially by computing the Distance from Feature Space (DFFS). The output image is refined to remove the regions detected from the Masking.
The authors Ma and Mohamed, (2015) [15] have proposed a new concept called as image processing pipeline to improve the effects of illumination in the image. BU-4DFE Dataset images are used as training Dataset and 3D images of different facial expressions are tested by adding various lighting conditions. An image processing pipeline is proposed which includes gamma correction, selective filtering and contrast equalization. SIFT key point's detector (Guo et al., 2018) [16] is used to compare the training and testing illuminations of the images. Fisher face classifier is used to evaluate the results of preprocessing pipeline on a classifier.
The authors Pal and Porwal, (2015) [17] have proposed a Local Brightness Normalization (LBN) algorithm to correct global and local vertical stripes in images. In order to accomplish this task, they had utilized noise fractiontransformation-based filtering to remove noise from the image. Then based on the output values, global and local stripes are corrected. It is observed that though detecting human facial emotions using geometrical structure is popular technique in recent times, the complexity lies in geometric variability in both emotion expression and the neutral face as it directly affects geometric FER methods.

Reviews on Facial Segmentation and Landmark Detection
Image segmentation is a critical stage in the FER preprocessing process. Here, the facial image is divided into multiple segments to extract data from the attributes of facial image. This supports in locate facial components and boundaries in the facial image. In FER, ROI segmentation technique is used in general as it segments those interested face regions which are actively participated in expression recognition. The authors used Interface's 49-point facial landmark detector has been used to localize the points of Eyebrows, Lips and Nose. The upper lip and eyebrows edges are detected by using edge detection algorithms such as Sobel, Canny, Prewitt and Roberts (Zhang and Yan, 2009) [18]. After performing the edge detection, any false rates are removed by using Otsu's Thresholding method (Feng and Zhao, 2017) [19]. Finally, emotion classification is done using MLP classifier. By extracting the principal curvatures and form index of the eyes and nose, it is possible to recognise facial emotions more rapidly by analysing the characteristics of the face. (Vezzetti et al., 2017) [20].

Conventional approaches
In addition to aforementioned distinction between static FER and FER on sequences, approaches to FER can also be categorized by the features used for classification. Conventional FER approaches use handcrafted features inferred from face in the facial extraction step of FER process. Deep-learning approaches often use CNN to extract features directly from images during training process. Approaches in this category usually adhere to following FER process schema depicted in Figure.2.

Figure 2. Conventional FER process
Facial images are first collected and preprocessed (histogram equalization, noise reduction, etc.). FER is usually performed on grayscale images as color does not carry significant information about the expression. Next step is face region detection. It is important to regionalize face in the image before attempting to localize facial landmarks to avoid falsepositives. Multiple approaches to face region detection have been proposed over past few decades. Haar cascade classifier is one of the more popular approaches. Localization is performed via AdaBoost method using Haar-like features (descriptors of contrast change between adjecent rectangular groups of pixels). Detected face region is then used as a region of interest for face landmark estimation (face alignment). Many face alignment approaches use cascade of regressors. Each regressor is improving on landmark position estimate based on image features relative to the previous landmark position estimate. In [25] author's use ensmeble of regression trees learned by gradient boosting to achieve super-real-time performance while maintaining state-of-the-art accuracy on face alignment problem. Feature extraction step uses face landmarks to produce feature vector for training. Temporal and appearance features are also often extracted in addition to geometric landmark features. SVMs are dominant classification method in conventional FER approaches. Radial Basis Function (RBF) kernel SVM seems to usually outperform linear SVM in FER.

Facial Expression Analysis and Estimation Based on Algorithm
In [26] author's present a real-time mobile application for FER using a set of SVMs to recognize 7 basic emotions. Active Shape Model (ASM) [14] is used to locate 77 face landmarks which are then used to generate 13 high-level distance features. The model performs classification based on displacements relative to the neutral feature set. During classification process each frame is first classified by binary classifier detecting neutral emotion state. Extracted features from neutral frames are then used to update the current neutral feature set. A CK+ dataset was used for training. Reported accuracy on CK+ dataset is 87.9%.
In [27] author's used Elastic Bunch Graph (EBG) [16] to initialize 52 landmark positions which are then tracked in rest of the frames in sequence using Gabor jets. The classification is performed by SVM using features of two types. First type is x and y displacement of 52 landmarks relative to neutral features. Second type is Euclidean distance and angle change between all pairs of landmarks relative to the distances and angles in neutral features. Neutral frames are not being recognized in-process but rather an assumption is made that neutral frame is always the first frame in sequence. Final feature vector is selected from a feature pool consisting of the two aforementioned types of features using AdaBoost with Dynamic Time Warping (DTW) similarity. CK+ dataset was used for training and reported accuracy on this dataset is 97.2%. In [28] author's present real-time FER system using multi-block Local Binary Pattern (LBP) appearance features and Principal Component Analysis (PCA) to classify 6 basic emotions (neutral emotion is not being classified). In proposed model Haar cascade is first used to detect face region in source image. Face region is then divided into small subsections and the LBP histogram is calculated for each block. The resulting feature vector is formed by joining together separate LBP histograms. The classification is done using PCA Eigen values for each emotion. Reported accuracy on custom dataset is 97%. The author, in contrast to [29], who recovered appearance features from the global face area, derived region specific appearance LBP features by partitioning the face region into 29 sector particular local regions. Incremental search approach was employed to localize important local regions in order to reduce dimensionality. In addition to appearance LBP features, geometrical landmark features were also extracted using implementation of [12]. Final feature vector is presented to linear SVM classifier. Model was validated against CK+ dataset with reported accuracy of 91.8% when classifying 7 basic emotions which are summarized in the table 1.

Deep-learning approaches
Deep-learning approaches to FER often use CNN to either perform classification directly or to extract latent features. In order to capture temporal aspect of expressions Recurrent Neural Networks (RNN) are sometimes used as well.
In their submission to the 2015 Emotion Recognition in the Wild contest, author's in [30] examined effectiveness of transfer-learned CNN on FER problem with small available dataset. They used pre-trained CNN model of VGG-CNN-M-2048 [20] architecture which was trained on generic image recognition task using images from ImageNet. This base model was then transfer-learned in two fine-tuning phases using EmotiW and FER-2013 dataset. Resulting model achieved 55.6% accuracy on the test set. In [31] the authors introduced a combination deep temporal appearance convolutional network (DTAN) and deep temporal geometry network (DTGN) model. Softmax outputs of these two networks is connected by element-wise addition with softmax applied to produce the final output. DTA is a 3D convolutional network where convolutional filters are shared along the time axis. This network captures temporal difference in appearance of the input images. Sequence of facial landmarks is used as input for the DTGN. Each landmark point is centered around a nose point and normalized using division by standard deviation of according dimension. Horizontal flipping and rotation were applied to input image sequences in order to increase the amount of data available for training. Model was trained using MMI dataset. Accuracy of 97.25% on CK+ dataset is reported.
Breuer and Kimmel employed deep CNN visualization methods to examine the relation between CNN-learned features and AUs in [22]. They used architecture of three convolutional blocks (consisting of a convolutional layer of 5x5 filters, activation by ReLu and max pooling layer with 2x2 window) and two fully-connected layers to perform emotion classification. This architecture achieved 98.5% accuracy on CK+ dataset measured by 10-fold cross validation. After examining the neuron activation in individual layers they found high correlation between learned features and FACS AUs. They then performed transfer learning on the same architecture to detect individual AUs and found high accuracy of 97.5% in AU presence detection and 96.1% in AU intensity prediction. This work demonstrates viability of deep CNN networks in FER related tasks.
Submission to the 2015 Emotion recognition in the Wild challenge (EmotiW) by authors of [32] proposes using hybrid CNN-RNN network for video classification. CNN network is used to extract high-level representation of input frames. Multiple architectures of CNN network with various depths were tried. Since the data provided as part of the challenge contained only videos labelled with single emotion per video, other static datasets were used for training of the CNN network. It was observed that deeper architectures tend to overfit on the static datasets and therefore a 3 convolutional block (consisting of convolutional layer of 9x9 filters, ReLu activation and max-pooling) was chosen as the best contender. The features extracted by CNN were used as input for IRNN network (RNN of ReLu using initialization trick as described in [24]). In addition to appearance features extracted by CNN authors also used geometrical landmark features and audio features to enhance the performance of the final model. To combat different lightning conditions between datasets histogram equalization was applied to the images. Best reported accuracy on the test dataset provided as part of the challenge was 52.875% and showed an improvement over pure-CNN approaches, as seen in the

Gaps in the Existing Research
In this research, the issues related to pre-processing, feature extraction and classification are considered. Here, the issues related to the face detection, illumination, noise, occlusion, facial landmarks, segmentation, localization of facial regions, geometrical mapping, feature vector construction and classification are discussed. The proposed approach uses various aspects such as distance metric, degree of deformation and degree of direction, geometrical properties along with a suitable distance measure to address these issues. The proposed approach has addressed these issues by superimposing the grid over features and extracting the feature vectors by using the degree of distortion, the direction, and the magnitude of choosing best classifier for classifying emotions with highest classification accuracy.

Universality of Emotion Expression
The question of universality of expression is an important one in order to establish whether emotion recognition based on facial expression is a viable general method. Until the second half of 20th century most academics believed that expressions of emotion are culturally bound and that only members of same or similar culture express emotions in same way. Charles Darwin, however, thought otherwise.
In [33] author's argued that expressions of emotion were universal as they were a product of evolution. To support this claim he proposed three principles.
Principle of serviceable habits describes some expression habits as helpful and therefore reinforced by natural selection. An example would be raising the eyebrows to increase field of view in an event of danger (correlated with fear emotion). Antithesis principle states that some expressions, such as shoulder shrugging, exist merely because of their opposite nature to a serviceable habit. Some expressions, as proposed by the expressive habit principle, are a result of discharge of excitement in the nervous system. Vocal roar of anger would be an example of such expression [34]. In mid-1960s Paul Ekman took an interest in this issue. Based on hundreds of hours of film capturing isolated cultures in New Guinea highlands, taken by Carleton Gajdusek and Richard Sorenson, Ekman found that in response to given stimuli the face expressions observed were in accordance with his expectations. No culturally unique expressions were observed either. Even though he was leaned towards the culturally relativistic viewpoint at first, this experience swayed him that Darwin might be right and inspired him to travel to the New Guinea highlands.
After conducting his own experiments and collecting supporting evidence for universality of expression, Ekman came up with the idea of" display rules" (a set of socially learned, culturally unique behaviors that are used to mask, exaggerate, diminish or exhibit expressions in specific cultural contexts) that would explain culture-based differences in expressions. In late 1960s he gathered evidence supporting this explanation by conducting a study of students in Tokyo, Japan and Berkeley, California. Author found that both Japanese and American students reacted the same way to emotion inducing clips as long as they were filmed alone by a hidden camera. However, when a scientist entered the room, Japanese students masked negative expressions with positive ones [27]. Some of the AU examples are presented in Figure.3.

Facial Expression Analysis and Estimation Based on Algorithm
Expression universality has been widely accepted as a number of cross-cultural studies yielded supporting results. In [18] authors argue that only a few of these studies were truly crosscultural. They claim that cultures that have been exposed to the western culture have adapted their emotion expressions and concepts. Furthermore, in the studies that were truly crosscultural (such as Ekman's experiments in New Guinea), an emotion conceptual context was included in the experimental method by asking the subjects to assign facial expression to word or description. Their free label experiment with participants from Himba ethnic group and America did not find supporting evidence for expression universality and authors are suggesting that emotion expressions are actually culture based to some degree. Whether the emotion expression is truly universal or not, the findings of universality between cultures exposed to western culture is sufficient for vast majority of potential FER applications.

Expression Measurement
In order to be able to measure and describe facial expression Dr. Paul Ekman and Dr. Wallace Friesen developed an anatomically-based system designed to measure human facial movements called Facial Action Coding System (FACS). Action Units (AUs) are used to characterize face movements that result in changes in appearance. Action unit is a numeric code that represents muscle activity of certain facial muscles or muscle groups. FACS distinguishes 46 different AUs (e.g. AU1 -Inner brow raiser, AU23 -Lip tightener).
Resulting FACS code is a string of present AUs. Presence of emotion is decided based on rules of presence of certain AUs. Even though FACS was primarily developed to help describing facial expressions while studying emotion it is a robust system that can be used in other areas as well [19].

░ 4. PROPOSED METHOD
The proposed approach is divided into three phases: Preprocessing, Feature extraction and Classification phases and are discussed below.

Pre-processing
The Viola Jones algorithm (Jain and Choudhary, 2018) [21] is used to recognize faces in pictures during the pre-processing phase. On the greyscale picture, the Viola-Jones method is used to identify the face. Using Haar-like properties, it defines a box and searches for a face inside it. The M*N image is taken as input and a window scale multiplier is used as a parameter to compute the windowed images mean and standard deviation. A cascade of false negative and positive rates is validated. The positive group of windows declared positive by the cascade is obtained as output and the face is detected. The detected face is cropped using background subtraction method where the foreground image is subtracted from the background image and the facial part is cropped effectively and the image is rescaled for 132x132 pixels. The rescaling is necessary pre-processing step as the image with higher pixel values appear to be fuzzier and pixelated. Also to bring multiple image files to a common scaling in the Dataset, rescaling is important. Scaling down the image makes the feature extraction process easier; hence, the face images are rescaled for 132*132 pixels. Further, the approach use normalization and filtering techniques to handle illumination invariant, transition and occlusions. To detect edges, the proposed FER system makes use of the Sobel edge detection technique   [22]. The Sobel edge detection technique is technically a discrete differential operator, and it is used to compute an approximate estimation of the gradient of the image brightness function in the proposed FER system. In mathematics, the Sobel operator is a typical edge detection operator that is based on the first derivative of a function. When using a similar local average operation as a result of the operator's introduction of the operation, the noise has a smooth impact and may effectively limit the influence of noise.

Feature Extraction
The feature extraction phase is the second phase in proposed FER system. The proposed approach considers the six basic emotions along with neutral expression as reference. Figure 4 shows these basic emotions. The approach considers eyes, eyebrows, nose and mouth as facial components.  In this phase, the pre-processed training set images are processed for extracting feature. The proposed approach initiates with image detecting facial landmarks using ensemble of regression tree approach (Kazemi & Sullivan, 2014) [23]. Considered here is a series of tagged images of face landmarks for training purposes. This data is used to train the regression trees in an ensemble to predict face landmark locations directly from the pixel intensities. Thus, the facial landmarks of eyes, eyebrows, nose and mouth components are detected accurately for further processing. The image is segmented based on Region of Interest (RoI) such as  (Kohler, 1981), a large number of segmentation thresholds are determined, and the image is separated into several target regions and backgrounds by a large number of segmentation thresholds [24]. The salient points are identified and extracted using the facial landmarks from the segmented RoI. The salient points that are extracted from the RoI include facial components such as eyes, eyebrows, nose and mouth and proceeded to analyse the facial expressions. The Figure.5 (a), (b) and (c) depicts the salient points of facial components. For instance, mouth component has five salient points represented as S17, S18, S19, S20, S21. Nose component has two salient points S15, S16. Left eyes and right eyes together have eight salient points S7, S8, S9, S10 and S11, S12, S13, S14. The left and right eyebrows together have six salient points S1, S2, S3 and S4, S5, S6. The movement of eyebrows are represented in terms of angles α, β, γ and θ and based on Geometric structures the expressions are analysed as explained in detail in below sub-section.

Geometric Structure Analysis
This sub-section presents the salient points and geometric structure of facial components for neutral and other facial expressions. This is depicted in Figure 6. The proposed FER system considers all facial expressions along with neutral expression for geometric structure analysis. For want of space and clarity, the sub section shows the neutral and happy facial expressions alone. Also, the proposed FER framework analyze the geometric structure of all facial components, for simplicity, nose and mouth alone are presented.

Geometric Structure Analysis of Neutral Face
The proposed approach considers neutral face as reference face. Hence, the geometric structure analysis of neutral face is necessary and is captured. Based on the facial landmarks, the salient points are extracted that locate two nostrils. This includes sharp edge of the left nostril and right nostril along with lower tip of the both nostrils. The mouth component extracts salient points of the lips which represents left end point, right end point, upper and lower lip mid points along with curve. The salient points that are extracted from the nose and mouth components supports in building the geometric structure for these in the neutral face. Figure 7 shows the salient points and its geometric structure for nose component in neutral face. The distances are computed between the salient points for each component. Similarly, the salient points that are extracted for eyes and eyebrows are linearly combined and considered as geometric structure.

Figure 7. Salient Points and Geometric Structure for Nose and
Mouth components in Neutral face Table 3 Represents nose and mouth components feature vectors and their salient points that represents the pixel information (x, y). The pixel information is used to compute the distances. The proposed approach analyses the facial expressions in terms of distance and angle measurement using distance metric.

Euclidian Distance Measure
The Euclidean distance metric, which reflects the normal distance between two locations and can be calculated using the Pythagorean formula, is defined as follows: The salient points of facial components for neutral expressions are computed using distance metric. The approach uses Euclidian distance metric for computing the distance between the salient points within the component and across the components. This is depicted in Eq. (1). For instance, d1, d2, and d3 represents three Euclidean distances between salient points within the component and d4, d5, and d6 represents three Euclidean distances (ED) between salient points across the component and is shown below: Consequently, the distance formula translates Euclidean space to metric space, as shown in the diagram.
[PQ] is the length of the line segment that connects points P and Q, and it represents the Euclidean distance between them. If P = (p1, p2, pn) and Q = (q1, q2, qn) are two points in Euclidean n-space, then the distance between P and Q is given by: Where (p1, q1), (p2, q2) are coordinates of salient points. When using FER systems, these distances are utilised to compare a neutral expression face with various facial expressions like as happy or sad, among other things. All totally, 160 Euclidean distances were calculated between all pairs that could be made up with the 20 salient points that constitute a vector of 160 elements. This is mathematically represented in Eq. (2). This is depicted in Table 4.

]
(2) For the automated facial expression identification system, this vector reflects the neutral expression of the picture sample. Same procedure is carried out for other facial expressions also. This is presented in below sub-section.

Geometric Structure Analysis of Happy Face
The proposed FER system considers other facial expressions such as happy expression for the same subject (person) and same procedure is carried out. The salient points with respect to happy expression are extracted and its geometric structure is analysed. Figure 8 represents the salient points of mouth and nose components and their geometric structure for happy face.
Later, the distances and angles are computed between the salient points within the components and across the components. Table 5 Represents nose and mouth feature vectors in terms of salient a point that represents the pixel information (x, y) for happy face and is used to compute the distances.
The approach uses Euclidian distance metric for computing the distance and angles between the salient points within components and across components. This is depicted in Table  6. Similarly, the extraction of salient points and geometric structure mapping for other facial expression such as sad, happy, surprise, disgust, angry and fear are performed. Distance and angle computation is performed using distance metric. The vector that represents the basic facial expressions of the same subject is given as input to classification algorithm. The classifiers learn each facial expression and compare them with the neutral expression (reference). The classification procedure is discussed in detail in below subsection.

Facial Expression Classification
After computing the distances and angles between these salient points for facial expressions anger, disgust, fear, sad, surprise and happy, the proposed FER system compares these distance measures with the neutral face expression. This is given to classification algorithms as hypothetical rules as depicted in Eq. ( For experimentation, the JAFEE Dataset is divided into training set and testing set using n-fold cross validation to train the SVM. The facial images are segregated into different known classes. Based on the hypothesis, the SVM classifier is trained for the predicting the facial expressions. Facial expressions were recognised using a multi-class binary SVM classifier.

Figure 8. Salient Points and Geometric Structure of Happy face
For example, the SVM is a supervised classification approach that doesn't have any restrictions on how the data is distributed. Each input (Si, Sj) comprises k characteristics that describe the class 'Emotional Expressions,' and this model is trained using the linear kernel function. Once a categorization system for one kind of expression is mastered, it may be applied to all other types. Discrimination of all six possible emotional expressions has been gathered here.

░ 5. RESULTS AND DISCUSSION
A number of tests using JAFEE image databases are carried out in order to put the suggested approach to the test and assess its performance. The proposed system considers eyes, eyebrow, nose and mouth facial components of the input images and extracts salient points using facial landmarks. The geometric structures are mapped for each of these components and the distance and angle between the salient points for these components are computed. These were used as input for the SVM classifier, which was trained to learn and categorise facial expressions and obtained an accuracy of 86 percent. Aside from that, the suggested FER system makes use of multilayer perceptrons and random forests, both of which are extensions of SVM classifiers, for classification. In SVM, the Lib SVM classifier is used in conjunction with the Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid classifiers as kernels, among other things. MLP and RF models are also employed in the training process. The RF tree makes use of the resampling method to choose a random subset of the data from the Dataset. Table 7 Displays the JAFFE Dataset's classification accuracy using LibSVM and RBF. Using the confusion matrix, you can see how many variations there are for each face expression. As an example, in Table 3, out of 30 surprise facial expressions, 22 are correctly labelled as surprise, whereas 8 are incorrectly classified as furious (as well as joyful and disgust). Overall, 176 of 213 photos were properly classified by the suggested method, with an overall accuracy of 82.65%. Table 7 shows the categorization accuracy for each facial emotion. Accuracy using Lib SVM (RBF) on JAFFE Dataset is presented in Figure 9. Table 8 shows the accuracy of JAFFE's categorization using MLP. For haapy expressions, categorization accuracy is greatest, while for frightened expressions, it is lowest. The total accuracy of categorization is 88.31%. Accuracy using MLP on JAFFE Dataset is presented in Figure 10.   Table. 9 shows the classification accuracy of the RF method. The precision has increased marginally and now stands at 88.72 percent. For happy, neutral, and surprised expressions, the categorization accuracy has improved. Accuracy using RF on JAFFE Dataset presented in Figure 11. As can be seen from the aforementioned findings, the degree of precision in determining for happy and neutral expressions is high and for certain expressions, it is low. Thus, it is noticed that the proposed FER system performance on JAFEE Dataset using Euclidian distance between all pairs of salient points within and across facial components have given encouraging results. Comparison of Classification Accuracy is presented in Figure 12.