Xiaoxuan Ma (马霄璇)

I'm a second-year Ph.D. student in Computer Science at CFCS, Peking University. I'm a member of the Computer Vision and Digital Art group, advised by Prof. Yizhou Wang. I received my Bachelor's and Master's degrees in Computer Science from Peking University in 2018 and 2021, respectively.

Email  /  Google Scholar  /  LinkedIn  /  Github

profile photo

Research

I'm interested in computer vision and machine learning, especially 3D human pose estimation and reconstruction.

3D Human Mesh Estimation from Virtual Markers
Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Wentao Zhu, Yizhou Wang
CVPR, 2023

abstract / bibtex / paper / code

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes.

TBD

We introduce a novel representation named virtual markers, mimicking the effects of physical markers, which can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation.

GFPose: Learning 3D Human Pose Prior with Gradient Fields
Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, Yizhou Wang
CVPR, 2023

abstract / bibtex / paper / code / project page

Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks.

@article{ci2022gfpose,
  title = {GFPose: Learning 3D Human Pose Prior with Gradient Fields},
  author = {Ci, Hai and Wu, Mingdong and Zhu, Wentao and Ma, Xiaoxuan and Dong, Hao and Zhong, Fangwei and Wang, Yizhou},
  journal = {arXiv preprint arXiv:2212.08641},
  year = {2023},}

We present GFPose, a versatile and elegant framework to model plausible 3D human poses for various applications, by implicitly incorporates pose priors in gradient fiedls.

Virtual Pose: Learning Generalizable 3D Human Pose Models from Virtual Data
Jiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, Yizhou Wang
ECCV, 2022

abstract / bibtex / paper / code

While monocular 3D pose estimation seems to have achieved very accurate results on the public datasets, their generalization ability is largely overlooked. In this work, we perform a systematic evaluation of the existing methods and find that they get notably larger errors when tested on different cameras, human poses and appearance. To address the problem, we introduce VirtualPose, a two-stage learning framework to exploit the hidden “free lunch” specific to this task, i.e.generating infinite number of poses and cameras for training models at no cost. To that end, the first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses. It outperforms the SOTA methods without using any paired images and 3D poses from the benchmarks, which paves the way for practical applications.

@inproceedings{su2022virtualpose,
  title={VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data},
  author={Su, Jiajun and Wang, Chunyu and Ma, Xiaoxuan and Zeng, Wenjun and Wang, Yizhou},
  booktitle={European Conference on Computer Vision},
  pages={55--71},
  year={2022},
  organization={Springer}}

We address the generalization issues in monocular 3D absolute pose estimation by introducing an intermediate representation termed Abstract Geometry Representation (AGR).

Context Modeling in 3D Human Pose Estimation: A Unified Perspective
Xiaoxuan Ma*, Jiajun Su*, Chunyu Wang, Hai Ci, Yizhou Wang
CVPR, 2021

abstract / bibtex / paper / code

Estimating 3D human pose from a single image suffers from severe ambiguity since multiple 3D joint configurations may have the same 2D projection. The state-of-the-art methods often rely on context modeling methods such as pictorial structure model (PSM) or graph neural network (GNN) to reduce ambiguity. However, there is no study that rigorously compares them side by side. So we first present a general formula for context modeling in which both PSM and GNN are its special cases. By comparing the two methods, we found that the end-to-end training scheme in GNN and the limb length constraints in PSM are two complementary factors to improve results. To combine their advantages, we propose ContextPose based on attention mechanism that allows enforcing soft limb length constraints in a deep network. The approach effectively reduces the chance of getting absurd 3D pose estimates with incorrect limb lengths and achieves state-of-the-art results on two benchmark datasets. More importantly, the introduction of limb length constraints into deep networks enables the approach to achieve much better generalization performance.

@InProceedings{Ma_2021_CVPR,
author= {Ma, Xiaoxuan and Su, Jiajun and Wang, Chunyu and Ci, Hai and Wang, Yizhou},
title= {Context Modeling in 3D Human Pose Estimation: A Unified Perspective},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month= {June},
year= {2021},
pages= {6238-6247}}

We propose a general formula of context modeling in monocular 3D human pose estimation task.

Locally Connected Network for Monocular 3D Human Pose Estimation
Hai Ci*, Xiaoxuan Ma*, Chunyu Wang, Yizhou Wang
T-PAMI, 2020

abstract / bibtex / paper / code

We present an approach for 3D human pose estimation from monocular images. The approach consists of two steps: it first estimates a 2D pose from an image and then estimates the corresponding 3D pose. This paper focuses on the second step. Graph convolutional network (GCN) has recently become the de facto standard for human pose related tasks such as action recognition. However, in this work, we show that GCN has critical limitations when it is used for 3D pose estimation due to the inherent weight sharing scheme. The limitations are clearly exposed through a novel reformulation of GCN, in which both GCN and Fully Connected Network (FCN) are its special cases. In addition, on top of the formulation, we present locally connected network (LCN) to overcome the limitations of GCN by allocating dedicated rather than shared filters for different joints. We jointly train the LCN network with a 2D pose estimator such that it can handle inaccurate 2D poses. We evaluate our approach on two benchmark datasets and observe that LCN outperforms GCN, FCN, and the state-of-the-art methods by a large margin. More importantly, it demonstrates strong cross-dataset generalization ability because of sparse connections among body joints.

@ARTICLE{9174911,
author={Ci, Hai and Ma, Xiaoxuan and Wang, Chunyu and Wang, Yizhou},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
title={Locally Connected Network for Monocular 3D Human Pose Estimation}, 
year={2022},
volume={44},
number={3},
pages={1429-1442},
doi={10.1109/TPAMI.2020.3019139}}

We present an end-to-end approach by using Locally Connected Network (LCN) to estimate 3D human pose from a monocular image.

Optimizing Network Structure for 3D Human Pose Estimation
Hai Ci, Chunyu Wang, Xiaoxuan Ma, Yizhou Wang
ICCV, 2019

abstract / bibtex / paper / code

A human pose is naturally represented as a graph where the joints are the nodes and the bones are the edges. So it is natural to apply Graph Convolutional Network (GCN) to estimate 3D poses from 2D poses. In this work, we propose a generic formulation where both GCN and Fully Connected Network (FCN) are its special cases. From this formulation, we discover that GCN has limited representation power when used for estimating 3D poses. We overcome the limitation by introducing Locally Connected Network (LCN) which is naturally implemented by this generic formulation. It notably improves the representation capability over GCN. In addition, since every joint is only connected to a few joints in its neighborhood, it has strong generalization power. The experiments on public datasets show it: (1) outperforms the state-of-the-arts; (2) is less data hungry than alternative models; (3) generalizes well to unseen actions and datasets.

@InProceedings{Ci_2019_ICCV,
author = {Ci, Hai and Wang, Chunyu and Ma, Xiaoxuan and Wang, Yizhou},
title = {Optimizing Network Structure for 3D Human Pose Estimation},
booktitle = {Proceedings of the IEEE/CVF 
International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}}

We present Locally Connected Network (LCN) to overcome the limitations of GCN in 3D human pose estimation.

Education

eth Ph.D. candidate
CVDA Lab, CFCS, Peking University, Beijing
Sep. 2021 ~ Now
Supervisor: Prof. Yizhou Wang
eth Master's degree
CVDA Lab, CFCS, Peking University, Beijing
Sep. 2018 ~ Jun. 2021
Supervisor: Prof. Yizhou Wang
eth Bachelor's degree
Depart. of Computer Science, Peking University, Beijing, China
Sep. 2014 ~ Jun. 2018

Website template from Jon Barron.