Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection

CVPR'25

Wenqiao Li*¹Yao Gu*¹Xintao Chen*¹Xiaohao Xu²Ming Hu³Xiaonan Huang²Yingna Wu¹

¹Shanghaitech University²University of Michigan, Ann Arbor³Monash University

Abstract

Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes about 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes.

Phys-AD Dataset

Overview of Phys-AD dataset

Phys-AD dataset has 22 classes of objects, 6,359 videos and 796,320 frames in total.
Videos have a resolution as high as 1920*1080 to meet the demand of detailed appearance.
An FPS of 60 is adopted to better catch the fast motion of the object.
A single video is designed to be from 1s to 4s, depending on the class.

We apply multiple interactions according to the functions of the objects with UR5 robot and motor/servo, such as slide, move, squeeze, stretch, rotate, etc.

PAEval

To assess the ability of VLMs to understand videos, we propose a novel evaluation matrix: PAEval. By compare the outputs of a VLM with the standard descriptions of a video and the explanations of the abnomalities separately, we can get the PAEval scores of the video.
A label is designed by human and enhanced by GPT-4o. All of the enhanced labels are checked by human to ensure to be accurate. With this evaluation we expect the VLMs to focus on the esstial part and understand the video like experienced human.

Benchmark

We test our dataset with the state-of-art methods and classic methods according to AUROC and PAEval, which can be basically divided into 'Unsupervised methods', 'Weakly supervised methods' and 'Video-understanding methods'.

Benchmark results on Phys-AD dataset. (a)(b)Video-level AUROC results on unsupervised methods, weakly supervised methods and video-understanding method. (c)PAEval results on video-understanding method.

Citation

@article{li2025towards,
  title={Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection},
  author={Li, Wenqiao and Gu, Yao and Chen, Xintao and Xu, Xiaohao and Hu, Ming and Huang, Xiaonan and Wu, Yingna},
  journal={arXiv preprint arXiv:2503.03562},
  year={2025}
}

Abstract

Phys-AD Dataset

1

2

PAEval

Benchmark

Citation