Engineers for a long time have been trying to teach computers how to measure and understand emotions from text.

Inspired to do the same and find answers themselves, a group of researchers at the Centre for Visual Information Technology (CVIT) of the Indian Institute of Information Technology (Hyderabad) recently conducted an experiment where they used films to teach machines how to understand emotions.

According to one of the researchers Dhruv Srivastava, text-based emotion analysis was done earlier where dialogues were used to understand the mental state of a character.

Dhruv co-authored the study ‘How you are feeling? Learning Emotions and Mental States in Movie Scenes’ with Aditya Kumar Singh and Makarand Tapaswi. It has now been accepted for presentation at the upcoming ‘Conference on Computer Vision and Pattern Recognition’ in Vancouver scheduled from June 18 to 23.

Research details

“The researchers introduced a machine learning model that relies on a transformer-based architecture to understand and label emotions not only for each movie character in the scene but also for the overall scene itself,” an executive of IIIT-H said.

“With cinema possessing a vast amount of emotional data mirroring the complexities that exist in everyday life, the research group used movies for their study. Unlike static images, movies are extremely complex for machines to interpret.”

The emotions of a character in a scene cannot be summarised with a single label and estimating multiple emotions and mental states is important. “A character can go through a range of emotions in a single scene — from surprise and happiness to anger and even sadness,” Dhruv said.

Used an existing dataset

IIIT-H executive said, “To train their model, the team of researchers used an existing dataset of movie clips collected by Tapaswi for his previous work called MovieGraphs that provides detailed graph-based annotations of social situations depicted in movie scenes.”

The machine was trained to accurately label the emotions and mental states of characters in each scene through a three-pronged process — analysis of the full video, individual facial features, and reading the subtitles.

“We realised that combining multimodal information is important to predict multiple emotions. We were able to predict the corresponding mental states of the characters which are not explicit in the scenes,” said Aditya.