Biting Off More Than You Can Detect: Retrieval-Augmented Multimodal Experts for Short Video Hate Detection
Authors: Jian Lang, Rongpei Hong, Jin Xu, Yili Li, Xovee Xu, Fan Zhou
Year: 2025
Journal / Conference: WWW '25: Proceedings of the ACM on Web Conference 2025
Paper Link: https://doi.org/10.1145/3696410.371456
Abstract:
Short Video Hate Detection (SVHD) is increasingly vital as hateful content - such as racial and gender-based discrimination - spreads rapidly across platforms like TikTok, YouTube Shorts, and Instagram Reels. Existing approaches face significant challenges: hate expressions continuously evolve, hateful signals are dispersed across multiple modalities (audio, text, and vision), and the contribution of each modality varies across different hate content. To address these issues, we introduce MoRE(Mixture of Retrieval-augmented multimodal Experts), a novel framework designed to enhance SVHD. MoRE employs specialized multimodal experts for each modality, leveraging their unique strengths to identify hateful content effectively. To ensure model's adaptability to rapidly evolving hate content, MoRE leverages contextual knowledge extracted from relevant instances retrieved by a powerful joint multimodal video retriever for each target short video. Moreover, a dynamic sample-sensitive integration network adaptively adjusts the importance of each modality on a per-sample basis, optimizing the detection process by prioritizing the most informative modalities for each instance. Our MoRE adopts an end-to-end training strategy that jointly optimizes both expert networks and the overall framework, resulting in nearly a twofold improvement in training efficiency, which in turn enhances its applicability to real-world scenarios. Extensive experiments on three benchmarks demonstrate that MoRE surpasses state-of-the-art baselines, achieving an average improvement of 6.91% in macro-F1 score across all datasets.