Abstract: The popular CLIP model has empowered various zero-shot learning tasks by unifying them into the vision-language alignment framework. However, due to the dynamic subject-object interactions ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results