壓縮包內(nèi)含有CAD圖紙和說(shuō)明書(shū),均可直接下載獲得文件,所見(jiàn)所得,電腦查看更方便。Q 197216396 或 11970985
A Visual-Sensor Model for Mobile Robot Localisation
Matthias Fichtner Axel Gro_mann
Arti_cial Intelligence Institute
Department of Computer Science
Technische Universitat Dresden
Technical Report WV-03-03/CL-2003-02
Abstract
We present a probabilistic sensor model for camera-pose estimation in hallways and cluttered o_ce environments. The model is based on the comparison of features obtained from a given 3D geometrical model of the environment with features present
in the camera image. The techniques involved are simpler than state-of-the-art photogrammetric approaches. This allows the model to be used in probabilistic robot localisation methods. Moreover, it is very well suited for sensor fusion. The sensor model has been used with Monte-Carlo localisation to track the position of a mobile robot in a hallway navigation task. Empirical results are presented for this application.
1 Introduction
The problem of accurate localisation is fundamental to mobile robotics. To solve complex tasks successfully, an autonomous mobile robot has to estimate its current pose correctly and reliably. The choice of the localization method generally depends on the kind and number of sensors, the prior knowledge about the operating environment, and the computing resources available. Recently, vision-based navigation techniques have become increasingly popular [3]. Among the techniques for indoor robots, we can distinguish methods that were developed in the _eld of photogrammetry and computer vision, and methods that have their origin in AI robotics.
An important technical contribution to the development of vision-based navigation
techniques was the work by [10] on the recognition of 3D-objects from unknown viewpoints in single images using scale-invariant features. Later, this technique was extended to global localisation and simultaneous map building [11].
The FINALE system [8] performed position tracking by using a geometrical model of the environment and a statistical model of uncertainty in the robot's pose given the commanded motion. The robot's position is represented by a Gaussian distribution and updated by Kalman _ltering. The search for corresponding features in camera image and world model is optimized by projecting the pose uncertainty into the camera image.
Monte Carlo localisation (MCL) based on the condensation algorithm has been applied successfully to tour-guide robots [1]. This vision-based Bayesian _ltering technique uses a sampling-based density representation. In contrast to FINALE, it can represent multi-modal probability distributions. Given a visual map of the ceiling, it localises the robot globally using a scalar brightness measure. [4] presented a vision-based MCL approach that combines visual distance features and visual landmarks in a RoboCup application. As their approach depends on arti_cial landmarks, it is not applicable in o_ce environments.
The aim of our work is to develop a probabilistic sensor model for camerapose estimation. Given a 3D geometrical map of the environment, we want to find an approximate measure of the probability that the current camera image has been obtained at a certain place in the robot's operating environment. We use this sensor model with MCL to track the position of a mobile robot navigating in a hallway. Possibly, it can be used also for localization in cluttered o_ce environments and for shape-based object detection.
On the one hand, we combine photogrammetric techniques for map-based feature projection with the exibility and robustness of MCL, such as the capability to deal with localisation ambiguities. On the other hand, the feature matching operation should be su_ciently fast to allow sensor fusion. In addition to the visual input, we want to use the distance readings obtained from sonars and laser to improve localisation accuracy.
The paper is organised as follows. In Section 2, we discuss previous work. In Section 3, we describe the components of the visual sensor model. In Section 4, we present experimental results for position tracking using MCL. We conclude in Section 5.
2 Related Work
In classical approaches to model-based pose determination, we can distinguish two interrelated problems. The correspondence problem is concerned with _nding pairs of corresponding model and image features. Before this mapping takes place, the model features are generated from the world model using a given camera pose. Features are said to match if they are located close to each other. Whereas the pose problem consists of _nding the 3D camera coordinates with respect to the origin of the world model given the pairs of corresponding features [2]. Apparently, the one problem requires the other to be solved beforehand, which renders any solution to the coupled
problem very di_cult [6].
The classical solution to the problem above follows a hypothesise-and-test approach:
(1) Given a camera pose estimate, groups of best matching feature pairs provide initial guesses (hypotheses).
(2) For each hypothesis, an estimate of the relative camera pose is computed by minimising a given error function de_ned over the associated feature pairs.
(3) Now as there is a more accurate pose estimate available for each hypothesis, the remaining model features are projected onto the image using the associated camera pose. The quality of the match is evaluated using a suitable error function, yielding a ranking among all hypotheses.
(4) The highest-ranking hypothesis is selected.
Note that the correspondence problem is addressed by steps (1) and (3), and the pose problem by (2) and (4).
The performance of the algorithm will depend on the type of features used, e.g., edges, line segments, or colour, and the choice of the similarity measure between image and model, here referred to as error function. Line segments is the feature type of our choice as they can be detected comparatively reliably under changing illumination conditions. As world model, we use a wire-frame model of the operating environment, represented in VRML. The design of a suitable similarity measure is far more difficult.
In principle, the error function is based on the di_erences in orientation between corresponding line segments in image and model, their distance and difference in length, in order of decreasing importance, in consideration of all feature pairs present. This has been established in the following three common measures [10]. e3D is defined by the sum of distances between model line endpoints and the corresponding plane given by camera origin and image line. This measure strongly depends on the distance to the camera due to back-projection. e2D;1, referred to as in_nite image lines, is the sum over the perpendicular distances of projected model line endpoints to
corresponding, in_nitely extended lines in the image plane. The dual measure, e2D;2, referred to as in_nite model lines, is the sum over all distances of image line endpoints to corresponding, in_nitely extended model lines in the image plane.
To restrict the search space in the matching step, [10] proposed to constrain the number of possible correspondences for a given pose estimate by combining line features into perceptual, quasi-invariant structures beforehand. Since these initial correspondences are evaluated by e2D;1 and e2D;2, high demands are imposed on the accuracy of the initial pose estimate and the image processing operations, including the removal of distortions and noise and the feature extraction. It is assumed to obtain all visible model lines at full length. [12, 9] demonstrated that a few outliers already can severely affect the initial correspondences in Lowe's original approach due to frequent truncation of lines caused by bad contrast, occlusion, or clutter.
3 Sensor Model
Our approach was motivated by the question whether a solution to the correspondence problem can be avoided in the estimation of the camera pose. Instead, we propose to perform a relatively simple, direct matching of image and model features. We want to investigate the level of accuracy and robustness one can achieve this way.
The processing steps involved in our approach are depicted in Figure 1. After removing the distortion from the camera image, we use the Canny operator to extract edges. This operator is relatively tolerant to changing illumination conditions. From the edges, line segments are identi_ed. Each line is represented as a single point (_; _) in the 2D Hough space given by _ = x cos _ + y sin _. The coordinates of the end points are neglected. In this representation, truncated or split lines will have similar coordinates in the Hough space. Likewise, the lines in the 3D map are projected onto the image plane using an estimate of the camera pose and taking into account the
visibility constraints, and are represented as coordinates in the Hough space as well. We have designed several error functions to be used as similarity measure in the matching step. They are described in the following.
Centred match count (CMC)
The first similarity measure is based on the distance of line segments in the Hough space. We consider only those image features as possible matches that lie within a rectangular cell in the Hough space centred around the model feature. The matches are counted and the resulting sum is normalised. The mapping from the expectation (model features) to the measurement (image features) accounts for the fact that the measure should be invariant with respect to objects not modelled in the 3D map or unexpected changes in the operating environment. Invariance of the number of visible features is obtained by normalisation. Speci_cally, the centred match count measure
sCMC is defined by:
where the predicate p de_nes a valid match using the distance parameters (t_; t_) and the operator # counts the number of matches. Generally speaking, this similarity measure computes the proportion of expected model Hough points hei 2 He that are con_rmed by at least one measured image Hough point hmj 2 Hm falling within tolerance (t_; t_). Note that neither endpoint coordinates nor lengths are considered here.
Grid length match (GLM)
The second similarity measure is based on a comparison of the total length values of groupes of lines. Split lines in the image are grouped together using a uniform discretisation of the Hough space. This method is similar to the Hough transform for straight lines. The same is performed for line segments obtained from the 3D model. Let lmi;j be the sum of lengths of measured lines in the image falling into grid cell (i; j), likewise lei;j for expected lines according to the model, then the grid length match measure sGLM is de_ned as:
For all grid cells containing model features, this measure computes the ratio of the total line length of measured and expected lines. Again, the mapping is directional, i.e., the model is used as reference, to obtain invariance of noise, clutter, and dynamic objects.
Nearest neighbour and Hausdorf distance
In addition, we experimented with two generic methods for the comparison of two sets of geometric entities: the nearest neighbour and the Hausdor_ distance. For details see [7]. Both rely on the de_nition of a distance function, which we based on the coordinates in the Hough space, i.e., the line parameter _ and _, and optionally the length, in a linear and exponential manner. See [5] for a complete description.
Common error functions
For comparisons, we also implemented the commonly used error functions e3D, e2D;1, and e2D;2. As they are de_ned in the Cartesian space, we represent lines in the Hessian notation, x sin _ ?? y cos _ = d. Using the generic error function f, we de_ned the similarity measure as:
where M is the set of measured lines and E is the set of expected lines. In case of e2D;1, f is de_ned by the perpendicular distances between both model line endpoints, e1, e2, and the in_nitely extended image line m:
Likewise, the dual similarity measure, using e2D;2, is based on the perpendicular
distances between the image line endpoints and the in_nitely extended model line.
Recalling that the error function e3D is proportional to the distances of model line endpoints to the view plane through an image line and the camera origin, we can instantiate Equation 1 using f3D(m; e) de_ned as:
where ~nm denotes the normal vector of the view plane given by the image endpoints ~mi = [mx;my;w]T in camera coordinates.
Obtaining probabilities
Ideally, we want the similarity measure to return monotonically decreasing values as the pose estimate used for projecting the model features departs from the actual camera pose. As we aim at a generally valid yet simple visual-sensor model, the idea is to abstract from speci_c poses and environmental conditions by averaging over a large number of di_erent, independent situations. For commensurability, we want to express the model in terms of relative robot coordinates instead of absolute world coordinates. In other words, we assume
to hold, i.e., the probability for the measurement m, given the pose lm this image has been taken at, the pose estimate le, and the world model w, is equal to the probability of this measurement given a three-dimensional pose deviation 4l and the world model w.
The probability returned by the visual-sensor model is obtained by simple scaling:
4 Experimental Results
We have evaluated the proposed sensor model and similarity measures in a series of experiments. Starting with arti_cially created images using idealized conditions, we have then added distortions and noise to the tested images. Subsequently, we have used real images from the robot's camera obtained in a hallway. Finally, we have used the sensor model to track the position of the robot while it was travelling through the hallway. In all these cases, a three-dimensional visualisation of the model was obtained, which was then used to assess the solutions.
Simulations using arti_cially created images
As a first kind of evaluation, we generated synthetic image features by generating a view at the model from a certain camera pose. Generally speaking, we duplicated the right-hand branch of Figure 1 onto the left-hand side. By introducing a pose deviation 4l, we can directly demonstrate its inuence on the similarity values. For visualisation purposes, the translational deviations 4x and 4y are combined into a single spatial deviation 4t. Initial experiments have shown only insigni_cant di_erences when they were considered independently.
Fig. 2: Performance of CMC on arti_cially created images.
For each similarity measure given above, at least 15 million random camera poses were coupled with a random pose deviation within the range of 4t < 440cm and 4_ < 90_ yielding a model pose.
The results obtained for the CMC measure are depicted in Figure 2. The surface of the 3D plot was obtained using GNUPLOT's smoothing operator dgrid3d. We notice a unique, distinctive peak at zero deviation with monotonically decreasing similarity values as the error increases. Please note that this simple measure considers neither endpoint coordinates nor lengths of lines. Nevertheless, we obtain already a decent result.
While the resulting curve for the GLM measure resembles that of CMC, the peak is considerably more distinctive. This conforms to our anticipation since taking the length of image and model lines into account is very signi_cant here. In contrast to the CMC measure, incidental false matches are penalised in this method, due to the differing lengths.
The nearest neighbour measure turned out to be not of use. Although linear and exponential weighting schemes were tried, even taking the length of line segments into account, no distinctive peak was obtained, which caused its exclusion from further considerations.
The measure based on the Hausdor_ distance performed not as good as the first two, CMC and GLM, though it behaved in the desired manner. But its moderate performance does not pay off the longest computation time consumed among all presented measures and is subsequently disregarded.
So far, we have shown how our own similarity measures perform. Next, we demonstrate how the commonly used error functions behave in this framework.
The function e2D;1 performed very well in our setting. The resulting curve closely resembles that of the GLM measure. Both methods exhibit a unique, distinctive peak at the correct location of zero pose deviation. Note that the length of line segments has a direct e_ect on the similarity value returned by measure GLM, while this attribute implicitly contributes to the measure e2D;1, though both linearly. Surprisingly, the other two error functions e2D;2 and e3D performed poorly.
Toward more realistic conditions
In order to learn the e_ect of distorted and noisy image data on our sensor model, we conducted another set of experiments described here. To this end, we applied the following error model to all synthetically generated image features before they are matched against model features. Each original line is duplicated with a small probability (p = 0:2) and shifted in space. Any line longer than 30 pixel is split with probability p=0:3. A small distortion is applied to the parameters (_; _; l) of each line according to a random, zeromean Gaussian. Furthermore, features not present in the model and noise are simulated by adding random lines uniformly distributed in the image. Hereof, the orientation is drawn according to the current distribution of angles to yield fairly `typical' features.
The results obtained in these simulations do not di_er significantly from the first set of experiments. While the maximum similarity value at zero deviation decreased, the shape and characteristics of all similarity measures still under consideration remained the same.
Using real images from the hallway
Since the results obtained in the simulations above might be questionable with respect to real-world conditions, we conducted another set of experiments replacing the synthetic feature measurements by real camera images.
To compare the results for various parameter settings, we gathered images with a Pioneer 2 robot in the hallway o_-line and recorded the line features. For two di_erent locations in the hallway exemplifying typical views, the three-dimensional space of the robot poses (x; y; _) was virtually discretized. After placing the robot manually at each vertex (x; y; 0), it performed a full turn on the spot stepwise recording images. This ensures a maximum accuracy of pose coordinates associated with each image. That way, more than 3200 images have been collected from 64 di_erent (x; y) locations. Similarly to the simulations above, pairs of poses (le; lm) were systematically chosen
Fig. 3: Performance of GLM on real images from the hallway.
from with the range covered by the measurements. The values computed by the sensor model referring to the same discretized value of pose deviation 4l were averaged according to the assumption in Equation 2.
The resulting visualisation of the similarity measure over spatial (x and y combined) and rotational deviations from the correct camera pose for the CMC measure exhibits a unique peak at approximately zero deviation. Of course, due to a much smaller number of data samples compared to the simulations using synthetic data, the shape of the curve is much more bumpy, but this is in accordance with our expectation.
The result of employing the GLM measure in this setting is shown in Figure 3. As it reveals a more distinctive peak compared to the curve for the CMC measure, it demonstrates the increased discrimination between more and less similar feature maps when taking the lengths of lines into account.
Monte Carlo Localisation using the visual-sensor model
Recalling that our aim is to devise a probabilistic sensor model