400 Y. Konishi et al.
The remaining contents of the paper are organized as follows: Sect. 2 pre-
sented related work on 6D pose estimation, image features for texture-less objects
and search data structures. Section 3 introduces our proposed PCOF, HPT and
6D pose estimation algorithm based on them. Section 4 evaluates the proposed
method and compare it with state-of-the-art methods. Section 5 concludes the
paper.
2 Related Work
6D Pose Estimation. 6D pose estimation has been extensively studied since
1980s and in the early days the template based approaches using a monocular
image [3–5] were the mainstream. Since the early 2000s, keypoint detections
and descriptor matchings became popular for detection and pose estimation
of 2D/3D objects due to their scalability to the increasing search space and
robustness to the changes in object pose. Though they can handle texture-less
objects when using line features as the descriptors for matching [10,11], they
were fragile to cluttered backgrounds because the line features were too simple
to suffer from many false correspondences in the backgrounds.
Voting based approaches as well as template based approaches have a long
history, and they have also been applied to detection and pose estimation of
2D/3D objects. Various voting based approaches were proposed for 6D pose
estimation such as voting by dense point pair features [12], random ferns [13],
Hough forests [14], and coordinate regressions [15]. Though they are scalable to
increasing image resolutions and the number of object classes, the dimensionaliy
of search space is too high to estimate precise object pose (excessive quantiza-
tions of 3D pose space are required). Thus they need post-processings for pose
refinements, which spend additional time.
CNN based approaches [16–18] recently showed impressive results on 6D
pose estimations. However, they take a few seconds even when using GPU and
they are not suitable for robotic applications where near real-time processing is
required on poor computational resources.
Template based approaches have been shown to be practical both in accuracy
and speed for 6D pose estimation of texture-less objects [6,7,19]. Hinterstoisser
et al. [8,9] showed their LINE-2D/LINE-MOD which is based on the quantized
orientations and the optimally arranged memory quickly estimated 6D pose of
texture-less objects against cluttered backgrounds. LINE-2D/LINE-MOD was
further improved by discriminative training [20] and by hashing [21,22]. However,
the discriminative trainig required additional negative samples and the hashing
led to suboptimal performance in the estimation accuracy.
Image Features for Handling Texture-less Objects. Image features used
in template matching heavily influence the performance of pose estimation from
a monocular image. Though edges based template matchings have been applied
to detection and pose estimation of texture-less objects, they often required the
additional algorithm such as segmentation [19] or the additional hardware like
a multi-flash camera [6] to suppress cluttered edges in the backgrounds.