What is Semantic Segmentation?
How an object can be
identified and delineated in an image or a video frame ? The
first thing coming to mind is to store a description of the
object in the computer memory and teach the computer to match it to
different portion of the image. This approach is feasible when the
object to be found is known beforehand, but becomes unrealistically
laborious as the number of objects increases. So, the question
arises if a frame can be split (segmented) into objects prior to
their recognition. As, for example, when we make out an unknown
and odd-colored fish against a strange-looking sea floor.
Segmentation of an image into objects based on their generic
properties (features) is called semantic segmentation. One of
the most important properties of objects is that they can occlude
(screen, hide) other objects. Segmentation without recognition is
essential for content-based video encoding promoted by the MPEG-4
standard.
The goal of semantic
segmentation is basically to find the occluding edges in the image. One
can safely assume that an object is found when a region is located
in the image such that its boundary is an occluding edge, implying
that 3D scene points projected on its opposite sides are at
different depths from the viewer. The occluding edge can, therefore,
be defined as a chain of points in the image corresponding to sudden
changes in the distance to the viewed object surfaces in the scene.
However, it does not suffice to locate an occluding edge. One also
needs to know what side of it corresponds to the occluded object and
where the occluder is.
The approach: first locate
all edges in the image and then try to identify those which are
occluding boundaries. The initial edges for our
analysis are supplied by the color
segmentation procedure.
Being closed, such edges are especially suitable. The occurrence of
an occluding edge can be inferred from various measurements and
computations:
(a) direct measurement of the distance to
the surface points for each pixel in the image (3D-imaging);
(b) measuring local motions, including
the optic flow on both sides of the boundary and the motion of three
boundaries meeting in a junction
(motion
analysis);
(c) measuring global motions. If all
regions belonging to one object can be accurately mapped onto the
next frame by a color- (intensity-) preserving multiparametric
transformation (e.g., an affine one), this can be of great help in
locating occluding boundaries
(motion
analysis);
(d) junction analysis. Edges in the image
can occur as a result of widely different events in the 3D-scene:
occlusions, abrupt variations of surface color or intensity, abrupt
changes in surface orientation (surface edges) and, therefore,
illumination, etc. The points where three or more contours meet are
commonly called junctions. Several types of junctions can be
identified depending on the edge intersection geometry, for example,
T-junctions and Y-junctions. Junction analysis on a single image and
junction type tracing from junction to junction can significantly
reduce the number of possible variants to assign occlusion labels to
region contours (junction
analysis).
The knowledge that some
contour segment is an occluding boundary may often reduce or even
eliminate the ambiguity in the interpretation of other boundaries in
the same frame or other frames. Suppose,
for example, that a moving object was reliably identified and
followed over a number of frames. Then, by tracking, the same object
can also be regarded as an occluder in those frames where it is no
longer moving. This type of tracking was implemented in Project GM3
. Boundary tracing within one
frame was used in Project
SI1.
Thanks for reading ! |