Preliminary

Recently, as the text-to-motion generation research becomes popular, approaches to apply them in various domains remain unexplored.

One direct application is to retarget the generated motion to existing agents. In MoMask's code, they provide a tool to convert the generated motion's keypoint to BioVision Motion Capture (BVH) format. This tool allows us to convert the generated motion representation to the format that is widely adopted by industries, which enables us to put them into popular 3D modeling software like Blender.

However, there are artifacts appear in the converted motion format, such as twisted foot after turning around, wrongly oriented head, and temporally-inconsistent motion. To address this issue, it is required to closely inspect the code and fix it to present a visually more pleasing motion visualization result.

Analysis

MoMask implemented the conversion using the following steps for each frame. First, they extract keypoints from the generated motion representation, which is composed of body joints' position, velocity, and rotation information. Note that the reason why they only extract position information is because they don't want the imprecise predicted rotation information to interfere the conversion results.

    positions = positions[:, self.re_order]
    new_anim = self.template.copy()
    new_anim.rotations = Quaternions.id(positions.shape[:-1])
    new_anim.positions = new_anim.positions[0:1].repeat(positions.shape[0], axis=-0)
    new_anim.positions[:, 0] = positions[:, 0]

Second, they calculate inverse kinematic of the extracted keypoints to match the pre-defined bone structure.

    def __call__(self):

        children = AnimationStructure.children_list(self.animation.parents)

        for i in range(self.iterations):

            for j in AnimationStructure.joints(self.animation.parents):

                c = np.array(children[j])
                if len(c) == 0: continue

                anim_transforms = Animation.transforms_global(self.animation)
                anim_positions = anim_transforms[:, :, :3, 3]
                anim_rotations = Quaternions.from_transforms(anim_transforms)

                jdirs = anim_positions[:, c] - anim_positions[:, np.newaxis, j]
                ddirs = self.positions[:, c] - anim_positions[:, np.newaxis, j]

                jsums = np.sqrt(np.sum(jdirs ** 2.0, axis=-1)) + 1e-10
                dsums = np.sqrt(np.sum(ddirs ** 2.0, axis=-1)) + 1e-10

                jdirs = jdirs / jsums[:, :, np.newaxis]
                ddirs = ddirs / dsums[:, :, np.newaxis]

                angles = np.arccos(np.sum(jdirs * ddirs, axis=2).clip(-1, 1))
                axises = np.cross(jdirs, ddirs)
                axises = -anim_rotations[:, j, np.newaxis] * axises

                rotations = Quaternions.from_angle_axis(angles, axises)

                if rotations.shape[1] == 1:
                    averages = rotations[:, 0]
                else:
                    averages = Quaternions.exp(rotations.log().mean(axis=-2))

                self.animation.rotations[:, j] = self.animation.rotations[:, j] * averages

            if not self.silent:
                anim_positions = Animation.positions_global(self.animation)
                error = np.mean(np.sum((anim_positions - self.positions) ** 2.0, axis=-1) ** 0.5)
                print('[BasicInverseKinematics] Iteration %i Error: %f' % (i + 1, error))

        return self.animation

However, the keypoints are ill-defined along roll axis. As a result, the converted BVH format, which contains each joint's rotation, seems egregious.

Solution

To address this issue, stablizing the roll axis of each joint is necessary. Although it seems easy that I just have to zero out all the roll axis, the essence of joint transformation from local joint space to global world space makes things complicated. As the rotation order matters, the effect of modifying the joint's roll axis in local space is not equal to modifying it in global space.

After searching on the Internet how others handle such a problem, I found a keyword: Swing-Twist decomposition. Swing-twist decomposition is widely adopted in gaming industry for controlling character animation in a more intuitive way. It works by decomposing a quaternion to a swing and twist components given an axis, so animators can easily interpolate character animation smoothly with either one components. I found it useful for our situation, since our goal of fixing global roll axis can be easily done by stablizing the twist component of each joint.

The pseudo code is provided below and will release after my submission to SIGGRAPH ASIA have good results.

for each joint:
    twist_axis = [0, 1, 0] # bone's roll is defined in Y-axis
    twist_rotation = local_joint_rotation.twist(twist_axis)
    swing_rotation = local_joint_rotation * twist_rotation.inverse()
    Get temporally-smoothed frames using exponential moving average over frames
    local_joint_rotation = swing_rotation * smoothed_twist_rotation