Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes about 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes.
Overview of Phys-AD dataset
We apply multiple interactions according to the functions of the objects with UR5 robot and motor/servo, such as slide, move, squeeze, stretch, rotate, etc.
To assess the ability of VLMs to understand videos, we propose a novel evaluation matrix: PAEval. By compare the outputs of a VLM with the standard descriptions of a video and the explanations of the abnomalities separately, we can get the PAEval scores of the video.
A label is designed by human and enhanced by GPT-4o. All of the enhanced labels are checked by human to ensure to be accurate. With this evaluation we expect the VLMs to focus on the esstial part and understand the video like experienced human.
We test our dataset with the state-of-art methods and classic methods according to AUROC and PAEval, which can be basically divided into 'Unsupervised methods', 'Weakly supervised methods' and 'Video-understanding methods'.
Benchmark results on Phys-AD dataset. (a)(b)Video-level AUROC results on unsupervised methods, weakly supervised methods and video-understanding method. (c)PAEval results on video-understanding method.
@article{li2025towards,
title={Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection},
author={Li, Wenqiao and Gu, Yao and Chen, Xintao and Xu, Xiaohao and Hu, Ming and Huang, Xiaonan and Wu, Yingna},
journal={arXiv preprint arXiv:2503.03562},
year={2025}
}
powered by Academic Project Page Template