128. BEYOND VISUAL SEMANTICS: USING CROSS-MODAL CONTEXT FOR IMAGE CLASSIFICATION

Department: Electrical & Computer Engineering
Faculty Advisor(s): Nuno Vasconcelos
Award(s): Honorable Mention

Primary Student
Name: Mandar Dilip Dixit
Email: mdixit@ucsd.edu
Phone: 858-534-4538
Grad Year: 2015

Abstract
A method is proposed to incorporate knowledge about language in image classification. It is based on the principle of cross-modal regularization, where data from one auxiliary modality (text) is used to regularize the statistical models used for image classification. Two forms of cross-modal regularization are studied. On the first, denoted feature space regularization, text information is used to regularize the feature space of the image classifier, by joint projection onto maximally correlated image-text subspaces. On the second, denoted boundary regularization, text information is used to regularize the image classification boundary. This involves a combination of a semantic representation for images and text, a cross-modal similarity function that maps semantic text labels into semantic image labels, and a discriminant classifier. The two forms of regularization are complementary, trading off classification performance for labeled training data. Both regularizers are data-driven, and support rich text (not just labels or tags). In both cases, text information is only used for improving a classifier that then processes images alone. Experiments show that both regularizers lead to substantial gains in bag-of-words image classification, either for stand-alone classifiers or classifiers used as building blocks of larger vision system. The gains of the two regularizers are shown to be cumulative, and enable classification with very small training sets.

« Back to Posters or Search Results