This study addresses these gaps by employing deep learning and language models to undertake multimodal music genre classification specifically within the context of Sotho-Tswana musical videos. The ...