Inferring 3D Articulated Models for Box Packaging Robot, Paul Heran Yang, Tiffany Low, Matthew Cong, Ashutosh Saxena. In R:SS workshop on mobile manipulation, 2011. [pdf]


Project Description

Given a point-cloud of a scene, we present a method for extracting an articulated 3D model that represents the kinematic structure of an object. We apply this to enable a robot to autonomously close boxes of several shapes and sizes. Such ability is of interest to a personal assistant robot, as well as to commercial robots in applications such as packaging and shipping.

Figure 1. The kinematic structures of a linear set of joints compared with a box

Previous work in articulated structures were able to represent planar kinematic models as a linear structure, i.e., a chain of rigid bodies connected by joints. Linear structures greatly simplify the inference problem because they allow parts of the joint inference problem to be considered as a set of independent sub-problems.

However, complex objects cannot be easily represented by linear chains. For example, a box is an example of an oriented arrangement of segments which are highly interconnected. In a standard box (see Figure 1), we can observe that the sides and base of the box are connected to four other faces while flaps extend outwards from only one face of the box. A linear model is unable to express these relations and constraints on the object’s structure. Our proposed model for 3D kinematic objects allows for increased expressivity while maintaining computational tractability.

Figure 2. The graphical representation of a box model and its kinematic structure.

We first segment the point-cloud into clusters, each of which is then decomposed into segments. Our goal is to match the segments to a desired kinematic model. In our application, the model is a box consisting of four sides, a base and four flaps. To identify a box in a given point cloud data, we search over all segments checking against a list of relations and features. A feature describes the segment relative to a ground reference, or the segment’s properties. In comparison, a relation encodes the relative values of a set of properties between at least two segments within the model. The relations of the internal segments of a box model is shown graphically in Figure 2. Note that if four sides have been determined, the problem of matching segments to the flaps and the bottom face becomes a series of problems linear to the size of n. The state space then becomes tractable for our application.

Following the conditional independence assumption encoded by the graphical model in Figure 2, we only have to consider a tractable number of box assignments for inferring the optimal configuration. We evaluate all possible segment configurations in our box model against the list of features and relations and select the optimal one.

Given the RGBD data, we first identify and extract segments. We apply the RANSAC algorithm [5] to extract segments from the point cloud. However the planes returned from RANSAC are not bounded by rectangles and may contain outliers. We filter the extracted planes using Euclidean clustering. To obtain the corners from the planes’ point cloud, we construct the convex hulls of the points and find the minimum bounding rectangles of the planes.

From the given set of segments, we apply our inference algorithm to find the optimal models to describe a box. This algorithm relies on a set of learned weights to score the model against the given list of features and relations. To speed up the inference further, we form a priority queue and search in the space of models with higher scores.

To close a box in the environment, the robot needs to identify and locate the box to manipulate it into the desired state. Each flap can be closed from a set of paths. Given the box model with the location and orientation of each flap, we use OpenRAVE’s [3] underconstrained inverse kinematics solver. The program applies Brute-force search approach by defining intermediate rotating planes. The planner will pick sampled points from each intermediate planes and search for paths through them.

Our experiments were conducted using an Adept Viper s850 6-DOF arm with a parallel-plate gripper giving a reach of approximately 100 centimeters. In order to obtain point cloud data, a Microsoft Kinect was mounted between the 3rd and 4th DOF. This position was chosen to allow for changes in orientation of the camera in order to obtain a variety of viewpoints.

To demonstrate the robustness of our algorithm, we collected point cloud data for several classes of boxes, including various sizes of standard packaging boxes and unusually shaped boxes. We then ran a series of experiments on an extensive dataset to determine the accuracy of the inference algorithm and then verified that the robot was able to close the box through simulation.

Figure 3. Gallery of point cloud data and results of our algorithm.

Figure 3 shows the original image, the point cloud data, the segments observed and finally the matching returned by the algorithm. We see that the robot can identify a diverse set of boxes. The performance remains stable for boxes of various sizes and the algorithm is able to recognize nonstandard boxes.

Our algorithm is able to filter out noise in the data such as objects placed around or inside the box. When more than one box is in the scene, the algorithm is able to recognize the segments as belonging to distinct objects.

The algorithm is able to identify box models with a 76.55% accuracy. For the closing task it is only the flaps that matter. There is an accuracy of 86.74% in correctly determining box flaps from segments. Therefore the robot shows good performance in identifying flaps to be closed. This also demonstrates that the algorithm is able to tolerate noise and different box configurations and types.

[1] D.H. Ballard. Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition, 1981.
[2] David Crandall, Pedro Felzenszwalb, and Daniel Huttenlocher. Spatial priors for part-based recognition using statistical models. In CVPR, 2005.
[3] R. Diankov and J. Kuffner. Openrave: A Planning Architecture for Autonomous Robotics. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-08-34, 2008.
[4] Oliver Brock Dov Katz, Yuri Pyuro. Learning to Manipulate Articulated Objects in Unstructured Environments Using a Grounded Relational Representation. In RSS, 2008.
[5] M.A. Fischler and R.C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM, 1981.
[6] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.
[7] Dov Katz and Oliver Brock. Extracting planar kinematic models using interactive perception. In Unifying Perspectives in Computational and Robot Vision. Springer, 2008.
[8] Xiangyang Lan and Daniel P. Huttenlocher. Beyond trees: Common factor models for 2d human pose recovery. In ICCV, 2005.
[9] R.B. Rusu, Z.C. Marton, N. Blodow, M. Dolha, and M. Beetz. Towards 3D Point Cloud Based Object Maps for Household Environments. RSS, 2008.
[10] R. Schnabel, R. Wahl, and R. Klein. Efficient RANSAC for Point-Cloud Shape Detection. In Computer Graphics Forum, 2007.
[11] J. Sturm, V. Pradeep, C. Stachniss, C. Plagemann, K. Konolige, and W. Burgard. Learning Kinematic Models for Articulated Objects. In IJCAI, 2009.
[12] Juergen Sturm, Kurt Konolige, Cyrill Stachniss, and Wolfram Burgard. 3d pose estimation, tracking and model learning of articulated objects from dense depth video using projected texture stereo. In RSS, 2010.

Heran (Paul) Yang (hy279)
Tiffany Low (twl46)
Matthew Cong (mdc238)
Prof. Ashutosh Saxena (asaxena at