Afford-Moiton

Move as You Say, Interact as You Can:
Language-guided Human Motion Generation with Scene Affordance

CVPR 2024, Highlight

Zan Wang^1,2, Yixin Chen², Baoxiong Jia², Puhao Li^2,3, Jinlu Zhang^2,4,
Jingze Zhang^2,3, Tengyu Liu², Yixin Zhu^5✉️, Wei Liang^1,6✉️, Siyuan Huang^2✉️

¹School of Computer Science & Technology, Beijing Institute of Technology
²National Key Laboratory of General Artificial Intelligence, BIGAI ³Dept. of Automation, Tsinghua University
⁴CFCS, School of Computer Science, Peking University ⁵Institute for AI, Peking University
⁶Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing
^✉️indicates corresponding author

arXiv Paper Supplementary Video Code

🏃‍♀️ We introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation.

🥰 Abstract

Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.

🥳 Results on HumanML3D

The person walks in a clockwise circle

The person is walking forward and then back the other direction

A person jogs forward and semi circles around to the left and then to the right

A man squats deeply three times while raising both arms in the air as if holding a dumbell

A person jumps from side to side right to left

A person waves with his left hand

🤩 Results on HUMANISE

Lie down on the table

Stand up from the table

Sit on the toilet

Sit on the table

Walk to the refrigerator

Walk to the chair

😋 Results on Our Novel Evaluation Set

A person wanders in the room around the table.

A man dances on the bed happily.

Someone stretches his arms overhead.

A person puts something on the table.

A person lies down on the floor.

A person picks an object up off the floor with his left hand.

A person is moving his arms around.

🥸 BibTeX

@inproceedings{wang2024move,
  title={Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance},
  author={Wang, Zan and Chen, Yixin and Jia, Baoxiong and Li, Puhao and Zhang, Jinlu and Zhang, Jingze and Liu, Tengyu and Zhu, Yixin and Liang, Wei and Huang, Siyuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Move as You Say, Interact as You Can:Language-guided Human Motion Generation with Scene Affordance