Long Short-Term Transformer for
Online Action Detection

Mingze Xu     Yuanjun Xiong     Hao Chen     Xinyu Li    
Wei Xia     Zhuowen Tu     Stefano Soatto    

Amazon/AWS AI

NeurIPS 2021, Spotlight    

[Paper]   [Code]

Abstract

We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm for online action detection, which employs a long- and short-term memory mechanism to model prolonged sequence data. It consists of an LSTR encoder that dynamically leverages coarse-scale historical information from an extended temporal window (e.g., 2048 frames spanning of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e.g., 32 frames spanning 8 seconds) to model the fine-scale characteristics of the data. Compared to prior work, LSTR provides an effective and efficient method to model long videos with fewer heuristics, which is validated by extensive empirical analysis. LSTR achieves state-of-the-art performance on three standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment.

Qualitative Results

Citation

@inproceedings{xu2021long,
    title={Long Short-Term Transformer for Online Action Detection},
    author={Xu, Mingze and Xiong, Yuanjun and Chen, Hao and Li, Xinyu and Xia, Wei and Tu, Zhuowen and Soatto, Stefano},
    booktitle={Conference on Neural Information Processing Systems (NeurIPS)},
    year={2021}
}

@Webpage borrow from Jon Barron