Digital Humanities
Feature Representations for Visual and Language: Towards Deeper Video Understanding

Feature Representations for Visual and Language: Towards Deeper Video Understanding

Prof. YANG Zekun, Assistant Professor, Department of Information and Computer Technology, Tokyo University of Science

Abstract:

This research enhances video understanding by leveraging Transformer-based models like BERT for feature representation in two tasks: video question answering and humor prediction. For video QA, using BERT to represent visual and subtitle semantics improved accuracy on TVQA and Pororo datasets. A comparative study of Transformers linked performance differences to their pre-training methods. For humor prediction, a novel multimodal method using pose, face, and subtitle features in a sliding window outperformed previous approaches on a new comedy dataset. The work highlights the importance of selecting optimal features and models for deeper video analysis.

Feature Representations for Visual and Language: Towards Deeper Video Understanding
2025-10-17

4:30 pm - 6:00 pm

Event Details:
Dates
17 Oct 2025
Time

4:30 pm - 6:00 pm

Register Now