Ph.D. Student · Computer Vision · Multimodal Learning

Di Wen

I study how visual AI can understand people, actions, objects, and environments in open, changing worlds. My research connects computer vision, multimodal perception, and embodied intelligence, with a long-term interest in models that reason, generalize, and learn beyond closed datasets.

I am a Ph.D. student in the CV:HCI Lab at Karlsruhe Institute of Technology, supervised by Prof. Rainer Stiefelhagen. Since April 2024, my work has spanned fine-grained action and scene understanding, human-object interaction, open-set and domain-generalized learning, noisy-label learning, and benchmarks for industrial, wearable, and microgravity settings. My publications include work at ECCV, NeurIPS, IROS, ICLR, ICRA, and IJCV, together with public datasets, benchmarks, and code. I am broadly open to research collaborations, student projects, and thesis supervision across computer vision and AI, including directions beyond my current publication topics.

Updates

News

  1. IMPACT is online with its project page, codebase, and dataset release links.
  2. Go Beyond Earth is accepted to ICLR 2026, and MICA is accepted to ICRA 2026.
  3. New work on unified video human-object interaction detection and anticipation is available on arXiv.
  4. RoHOI and Snap, Segment, Deploy are available as robustness and industrial assistant resources.

Research Agenda

Visual Intelligence Beyond Closed Benchmarks

Open-World Visual Understanding

Understanding actions, objects, scenes, and interactions when categories, contexts, temporal structure, and goals are not fixed in advance.

Multimodal and Embodied Perception

Connecting video, language, spatial cues, and human context so visual models can reason about activity, affordance, and future actions in the physical world.

Reliable Learning and Evaluation

Training and evaluating models under noisy labels, scarce annotations, domain shifts, uncertain evidence, and deployment pressure.

Selected Work

Publications

Public Releases

Selected Research Resources

A selected set of public datasets, benchmarks, and code releases connected to papers, not a complete representation of my research. The full publication record is available through the publication list and Google Scholar.

Public Dataset

IMPACT

Multi-view procedural action understanding dataset for industrial assembly.

Robustness Benchmark

RoHOI

Robustness benchmark for human-object interaction detection.

Research Code

MICA

Multi-agent industrial coordination assistant for recognition and interactive reasoning.

Contact

Get in Touch

I welcome research conversations, collaborations, student projects, and thesis supervision across computer vision and AI. I am especially interested in ideas that connect perception with reasoning, multimodal learning, embodied intelligence, scientific or industrial applications, and rigorous evaluation, and I am happy to discuss directions beyond the topics listed here.