Dataset Distillation via Vision-Language Category Prototype

1University of Toyama, 2Hokkaido University, 3Tsinghua University, 4Niigata University
Project Banner

Overview of the proposed framework. The framework starts with generating image-text pairs using the LLaVA model, followed by training a diffusion model. Image features are then compressed with an autoencoder, outliers are removed, and K-means clustering is applied to create image prototypes. For text prototypes, frequent words are extracted from descriptions, and the most representative sentence is selected. Finally, these prototypes guide the diffusion model to synthesize diverse and representative images.

Abstract

Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source vision-language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization.

BibTeX


        @inproceedings{zou2025vlcp,
          title={Dataset Distillation via Vision-Language Category Prototype},
          author={Zou, Yawen and Li, Guang and Su, Duo and Wang, Zi and Yu, Jun and Zhang, Chao},
          booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
          year={2025}
        }