Multimodal Language Modelling across Languages and Cultures: Grounding Strategies for Concepts and Events
This research was supported, in part, by an award from Gonville & Caius College, University of Cambridge.
Recent advances in multimodal language modelling have seen state-of-the-art performance in downstream vision-language tasks achieved by models that employ contrastive semantic pre-training. While grounding linguistic embeddings is typically assumed to improve the quality of natural language representations, we undertake an intrinsic semantic evaluation of multimodal representations obtained in contrastive visual pretraining in CLIP (Radford et al., 2021) and its video-text equivalent Video-CLIP (Xu et al., 2021). The effects of image and video grounding on concrete and abstract nominal concepts and verbal events are compared to unimodal BERT (Devlin et al., 2019) and Mirror-BERT (Liu et al., 2021) baselines. The typological generalisability of our monolingual results is subsequently explored by evaluating the performance of Italian CLIP (Bianchi et al., 2021) and multilingual CLIP (Carlsson et al., 2022). Our findings are interpreted in the context of psycholinguistic and semantic research on verbal embodiment and suggest current grounding techniques provide a uniform advantage for processing nouns over verbs in image-text and video-text pre-training.
Posted: