Ph.D. Student · Computer Vision · Multimodal Learning

Di Wen

I study how visual AI can understand people, actions, objects, and environments in open, changing worlds. My research connects computer vision, multimodal perception, and embodied intelligence, with a long-term interest in models that reason, generalize, and learn beyond closed datasets.

I am a Ph.D. student in the CV:HCI Lab at Karlsruhe Institute of Technology, supervised by Prof. Rainer Stiefelhagen. Since April 2024, my work has spanned fine-grained action and scene understanding, human-object interaction, open-set and domain-generalized learning, noisy-label learning, and benchmarks for industrial, wearable, and microgravity settings. My publications include work at ECCV, NeurIPS, IROS, ICLR, ICRA, and IJCV, together with public datasets, benchmarks, and code. I am broadly open to research collaborations, student projects, and thesis supervision across computer vision and AI, including directions beyond my current publication topics.

Google Scholar DBLP ORCID OpenReview GitHub LinkedIn KIT Profile Email

Updates

News

Apr 2026IMPACT is online with its project page, codebase, and dataset release links.
Apr 2026Go Beyond Earth is accepted to ICLR 2026, and MICA is accepted to ICRA 2026.
Apr 2026New work on unified video human-object interaction detection and anticipation is available on arXiv.
Jul 2025RoHOI and Snap, Segment, Deploy are available as robustness and industrial assistant resources.

Research Agenda

Visual Intelligence Beyond Closed Benchmarks

Open-World Visual Understanding

Understanding actions, objects, scenes, and interactions when categories, contexts, temporal structure, and goals are not fixed in advance.

Multimodal and Embodied Perception

Connecting video, language, spatial cues, and human context so visual models can reason about activity, affordance, and future actions in the physical world.

Reliable Learning and Evaluation

Training and evaluating models under noisy labels, scarce annotations, domain shifts, uncertain evidence, and deployment pressure.

Selected Work

Publications

Full Publication List Google Scholar

arXiv 2026

IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Di Wen, Z. Zhong, D. Schneider, M. Zaremski, L. Kunzmann, Y. Shi, R. Liu, et al.

A multi-granularity industrial assembly dataset for procedural action understanding across videos, annotations, and benchmark tasks.

Paper Code Project Dataset

MicroG-4M human action and scene understanding teaser

ICLR 2026

Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Di Wen, L. Qi, K. Peng, K. Yang, F. Teng, A. Luo, J. Fu, Y. Chen, R. Liu, Y. Shi, et al.

A benchmark direction for human action and scene understanding in microgravity environments.

Paper

ICRA 2026

MICA: Multi-Agent Industrial Coordination Assistant

Di Wen, K. Peng, J. Zheng, Y. Chen, Y. Shi, J. Wei, R. Liu, K. Yang, R. Stiefelhagen

A multi-agent assistant for industrial coordination, combining perception, recognition, and interactive reasoning.

Paper Code

arXiv 2025

RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen, K. Peng, K. Yang, Y. Chen, R. Liu, J. Zheng, A. Roitberg, D. P. Paudel, et al.

A robustness benchmark for evaluating human-object interaction detection under realistic shifts.

Paper Code

Snap Segment Deploy wearable industrial assistant pipeline

SMC 2025

Snap, Segment, Deploy: A Visual Data and Detection Pipeline for Wearable Industrial Assistants

Di Wen, J. Zheng, R. Liu, Y. Xu, K. Peng, R. Stiefelhagen

A wearable industrial assistant pipeline for visual data capture, segmentation, and deployment.

Paper Code

ECCV 2024

Referring Atomic Video Action Recognition

K. Peng, J. Fu, K. Yang, Di Wen, Y. Chen, R. Liu, J. Zheng, J. Zhang, et al.

A fine-grained video action recognition task centered on referring to atomic actions in video.

Paper arXiv

Public Releases

Selected Research Resources

A selected set of public datasets, benchmarks, and code releases connected to papers, not a complete representation of my research. The full publication record is available through the publication list and Google Scholar.

Public Dataset

IMPACT

Multi-view procedural action understanding dataset for industrial assembly.

Robustness Benchmark

RoHOI

Robustness benchmark for human-object interaction detection.

Research Code

MICA

Multi-agent industrial coordination assistant for recognition and interactive reasoning.

Research Pipeline

Gear8 / Snap, Segment, Deploy

Visual data and detection pipeline for wearable industrial assistants.

Contact

Get in Touch

I welcome research conversations, collaborations, student projects, and thesis supervision across computer vision and AI. I am especially interested in ideas that connect perception with reasoning, multimodal learning, embodied intelligence, scientific or industrial applications, and rigorous evaluation, and I am happy to discuss directions beyond the topics listed here.

Save Contact Collaboration Topics di.wen@kit.edu Google Scholar GitHub LinkedIn