Abstract: Video question answering (Video-QA) has emerged as a core task in the vision-language domain, which requires the models to understand a given video and answer textual questions related to ...