TriDF: Evaluating Perception, Detection, and Hallucination
for Interpretable DeepFake Detection

* Equal contribution. † Corresponding author.
1National Taiwan University 2National Yang Ming Chiao Tung University 3Jilin University

DeepFake types / tasks

16

Generators

51

High-quality samples (pairs)

~5K

Questions

76K

TriDF benchmark overview

Overview of TriDF

(a) Data generation:
Construct ~5K high-quality samples using 16 DeepFake techniques across image, video, and audio modalities.

(b) Fine-grained artifact taxonomy:
A hierarchical taxonomy of fine-grained artifacts decomposes detection into specific artifact analyses.

(c) Evaluation:
benchmark models on three pillars: Perception (spotting artifacts), Detection (real vs. fake), and Hallucination (explanation reliability).

Abstract

Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

Pipeline

Pipeline of TriDF

(a) Generation & Annotation: We first collect open-source human-related datasets across three modalities. We generate real-fake data pairs using 16 DeepFake techniques and perform quality control by authenticity and consistency metrics to obtain high-quality data. We then construct quality and semantic artifact questions and perform human annotation, resulting in reliable ground truth.

(b) Evaluation: We design three types of questions, e.g., True-False, Multiple-Choice, and Open-Ended. These questions are combined with high-quality data and fed into MLLMs for evaluation, where the model responses are then assessed using our proposed metrics to evaluate their perception ability, interpretable detection performance, and tendencies towards hallucination.

Findings

On TriDF, we benchmark 26 MLLMs (23 open-source; 3 proprietary) zero-shot on image, video, and audio DeepFakes—testing perception, detection, and hallucination. Our key findings are summarized as follows:

  • Hallucination can break the perception → detection link: even with high artifact coverage, severe hallucination drives DeepFake detection accuracy down to near-chance; reliable detection requires both accurate perception and low hallucination.
  • Semantic artifacts are the main bottleneck: models detect several local “quality” artifacts with relatively high accuracy, but artifacts requiring physical/social reasoning (anatomy, abnormal motion, background–subject incoherence) are consistently much harder and remain the key weakness.
  • Video DeepFakes are substantially harder than images: across video, detection accuracy drops and explanation coverage is roughly halved, suggesting current MLLMs don’t sufficiently capture temporal cues and need better temporal representations/mechanisms.

Comparison

We compare our proposed TriDF with existing benchmarks for DeepFake detection across several key dimensions. Symbols denote: ♠ Accuracy (e.g., F1-score, AUC), ♥ Similarity-based (e.g., ROUGE-L, CSS), ♦ LLM-as-a-judge (e.g., GPTScore), and ♣ Cover.

TriDF comparison figure

Comparison with related settings.

Templates and Examples

The figure outlines prompt templates for benchmark construction across three formats:

  • True-False <TFQ>: Verifies the presence, background, or location of specific artifacts.
  • Multiple-Choice <MCQ>: Identifies artifacts from a list, includes a “none of the above” option and allows for multiple selections.
  • Open-Ended <OEQ>: Defines a DeepFake forensics analyst persona (Type A & B) with guidelines for analysis and output formatting.

Evaluation

Expert evaluation visualization
Radar plot for evaluation results

Citation

@article{jiang2025tridf,
  title={TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection},
  author={Jiang-Lin, Jian-Yu and Huang, Kang-Yang and Zou, Ling and Lo, Ling and Yang, Sheng-Ping and Tseng, Yu-Wen and Lin, Kun-Hsiang and Chen, Chia-Ling and Ta, Yu-Ting and Wang, Yan-Tsung and others},
  journal={arXiv preprint arXiv:2512.10652},
  year={2025}
}