End-to-End Joint Semantic Segmentation of Actors and Actions in Video
Revista : European Conference on Computer Vision (ECCV)Tipo de publicación : Conferencia No DCC
Abstract
Traditional video understanding tasks include human action recognition and actor-object semantic segmentation. However, the joint task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this joint task in videos. Our model effectively leverages multiple input modalities, contextual information, and joint multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the large-scale Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our joint approach improves performance for zero-shot understanding, indicating generalizability of our jointly learned feature space.