Abstract: CLIP, a foundational vision-language model, has emerged as a powerful tool for open-vocabulary semantic segmentation. While freezing the text encoder preserves its powerful embeddings, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results
Feedback