DeepFake types / tasks
16
Generators
51
High-quality samples (pairs)
~5K
Questions
76K
(a) Data generation:
Construct ~5K high-quality samples using 16 DeepFake techniques across image, video, and audio modalities.
(b) Fine-grained artifact taxonomy:
A hierarchical taxonomy of fine-grained artifacts decomposes detection into specific artifact analyses.
(c) Evaluation:
benchmark models on three pillars: Perception (spotting artifacts), Detection (real vs. fake), and Hallucination (explanation reliability).
Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
(a) Generation & Annotation: We first collect open-source human-related datasets across three modalities. We generate real-fake data pairs using 16 DeepFake techniques and perform quality control by authenticity and consistency metrics to obtain high-quality data. We then construct quality and semantic artifact questions and perform human annotation, resulting in reliable ground truth.
(b) Evaluation: We design three types of questions, e.g., True-False, Multiple-Choice, and Open-Ended. These questions are combined with high-quality data and fed into MLLMs for evaluation, where the model responses are then assessed using our proposed metrics to evaluate their perception ability, interpretable detection performance, and tendencies towards hallucination.
On TriDF, we benchmark 26 MLLMs (23 open-source; 3 proprietary) zero-shot on image, video, and audio DeepFakes—testing perception, detection, and hallucination. Our key findings are summarized as follows:
We compare our proposed TriDF with existing benchmarks for DeepFake detection across several key dimensions. Symbols denote: ♠ Accuracy (e.g., F1-score, AUC), ♥ Similarity-based (e.g., ROUGE-L, CSS), ♦ LLM-as-a-judge (e.g., GPTScore), and ♣ Cover.
Comparison with related settings.
The figure outlines prompt templates for benchmark construction across three formats:
<TFQ>: Verifies the presence, background, or location of specific artifacts.
<MCQ>: Identifies artifacts from a list, includes a “none of the above” option and allows for multiple selections.
<OEQ>: Defines a DeepFake forensics analyst persona (Type A & B) with guidelines for analysis and output formatting.
Template Bench (1/3)
@article{jiang2025tridf,
title={TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection},
author={Jiang-Lin, Jian-Yu and Huang, Kang-Yang and Zou, Ling and Lo, Ling and Yang, Sheng-Ping and Tseng, Yu-Wen and Lin, Kun-Hsiang and Chen, Chia-Ling and Ta, Yu-Ting and Wang, Yan-Tsung and others},
journal={arXiv preprint arXiv:2512.10652},
year={2025}
}