Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored—despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples.
Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods.
The dataset is constructed based on the above pipeline. For each input image, we use 3 inpainting methods and 3 text-to-image methods to generate 3 partially manipulated images and 3 entirely fake images.
For each group of original, edited, or generated images, AnimeDL-2M not only provides segmentation masks as in traditional datasets, but also includes additional annotations such as image captions, object descriptions, mask labels, and editing methods. These enriched annotations enable a broader range of tasks to be conducted on this dataset and are intended to facilitate future research in related domains.
AnimeDL-2M contains 639,268 real images, 779,502 partially fake images and 884,129 fully AI-generated images. The CivitAI subset contains images generated by 14 different base models.
Images in AnimeDL-2M achieve high aesthetic scores by the latest evaluation model MPS.
The distribution of manipulated subjects is fairly diverse and seemingly, which contributes to model generalization and enables a more comprehensive evaluation of model performance.
Anime images exhibit distinctive visual characteristics that distinguish them from natural daily images, such as unrealistic lighting conditions, geometric abstractions, and the absence of sensor noise. These distinct properties underscore the necessity for specialized methods tailored to the IMDL tasks in the anime domain.
While it is commonly assumed that anime images contain fewer high-frequency components such as complex textures or stochastic noise, an overlooked yet crucial aspect is their retention of edge information in mid-to-high frequencies, especially the line contours. As anime images typically have clean and uncluttered scenes, line work in these images is generally sharp and well-defined. Furthermore, as these lines are manually drawn, they tend to exhibit a consistent artistic style across the image. Consequently, localized inconsistencies in stroke thickness, color, or drawing style may serve as effective cues for identifying image manipulations.
Additionally, prior studies have demonstrated that image manipulations frequently occur at the object level. Anime images, which typically comprise a limited number of semantically salient objects with well-defined boundaries, are especially amenable to object-level semantic reasoning. Motivated by these insights, we propose AniXplore, a model with dual-branch architecture that integrates semantic representations with frequency-aware features to enhance the IMDL in AI-generated anime images.
Models trained on conventional IMDL datasets lack generalizability to the distribution of AI-edited anime images. Models pre-trained on the GRE dataset perform relatively better. This implies that certain features of AI-generated manipulations can be learned and partially transferred. However, these models still perform poorly.
After fine-tuning on AnimeDL-2M, all models show significant improvements, confirming both the high training value and the annotation quality of the dataset. Some models achieve surprisingly high F1 scores in detection task. This suggests that while models cannot precisely locate manipulated regions, they can still capture global statistical cues such as unnatural frequency artifacts or noise distribution that distinguish fake images from real ones at a coarse level.
These findings collectively validate the presence of substantial domain gaps across manipulation methods and image styles, especially for localization tasks. Therefore, AnimeDL-2M serves as a necessary contribution to bridge this gap, offering a dedicated benchmark for AI-edited anime image forensics.
All models exhibit generally poor performance on the localization task under cross-dataset settings. This also suggests that training or fine-tuning on the target domain can substantially improve localization performance.
Generalize ability on detection task does not strongly correlate with localization performance. Some models achieve high detection accuracy across domains despite limited localization ability. This implies that detection generalization may depend more on the robustness of the model architecture than on the ability to capture specific forgery artifacts.
PSCC, as the only model that uses RGB images as the sole input modality, demonstrates the weakest generalization, highlighting the importance of multi-modal or multi-channel feature inputs for generalizability.
Meanwhile, both TruFor and MMFusion incorporate noise-based features, yet their performance differs significantly. This suggests that not all handcrafted features are equally effective, and the design of the feature extractor plays a critical role in mitigating overfitting. Therefore, careful selection and design of input features is essential for building more generalizable forensic models.