Accommodating Hybrid Retrieval in a Comprehensive Video Database Management System

更新时间:2023-08-25 06:08:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

EDICS

*Contact information:

Dr. Qing Li

Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong.Email: Tel: (852) 2788 9695 Fax: (852) 2788 8292

Accommodating Hybrid Retrieval in a Comprehensive Video

Database Management System

Shermann S.M. Chan Qing Li*

Department of Computer Science

City University of Hong Kong, ChinaYi Wu Yueting ZhuangDepartment of Computer ScienceZhejiang University, Hangzhou, China

Abstract

A comprehensive video retrieval system should be able to accommodate and utilize

various (complementary) description data in facilitating effective retrieval. In this

paper, we advocate a hybrid retrieval approach by integrating a query-based

(database) mechanism with content-based retrieval (CBR) functions. We describe

the VideoMAP+ architecture, discuss issues related to developing such a

comprehensive video database management system, and its specific language

mechanism (CAROL/ST with CBR) which provides an improved expressive power

than what pure query-based or CBR methods currently offer. We also describe an

experimental prototype being developed based on a commercial object-oriented

toolkit using VC++ and Java.

1. Introduction

A current important trend in multimedia information management is towards web-based/enabledmultimedia search and management systems. Video is a rich and colorful media widely used in many of ourdaily life applications like education, entertainment, news spreading, etc. Digital videos have diverse sourcesof origin such as cassette recorder, tape recorder, home video camera, VCD and Internet. Expressiveness ofvideo documents decides their dominant position in the next-generation multimedia information systems.Unlike traditional / static types of data, digital video can provide more effective dissemination ofinformation for its rich content. Collectively, a (digital) video can have several information descriptors: (1)metadata - the actual video frame stream, including its encoding scheme and frame rate; (2) media data - theinformation about the characteristic of video content, such as visual feature, scene structure and spatio-temporal feature; (3) semantic data - the text annotation relevant to the content of video, obtaining bymanual or automatic understanding.

Video metadata is created independently from how its contents are described and how its databasestructure is organized later. It is thus natural to define “video” and other meaningful constructs such as“scene”, “frame” as objects corresponding to their respective inherent semantic and visual contents.Meaningful video scenes are identified and associated with their description data incrementally. But the gapEDICS

1

EDICS

between the user realization and video content remains a big problem. Depending on the user’s viewpoint,the same video/scene may be given different descriptions. It is extremely difficult (if not impossible) todescribe the whole contents of a video, especially due to the visual content.

1.1 Background of Research

Over the last couple of years we have been working on developing a generic video management andapplication processing (VideoMAP) framework [CL99a, CL99b, LLS00]. A central component ofVideoMAP is a query-based video retrieval mechanism called CAROL/ST, which supports spatio-temporalqueries [CL99a, CL99b].

While the original CAROL/ST has contributed on working with video semantic data based on anextended object oriented approach, little support has been provided to support video retrieval using visualfeatures. To come up with a more effective video retrieval system, we have been making extensions to theVideoMAP framework, and particularly the CAROL/ST mechanism to furnish a hybrid approach

[CWLZ01]. In this paper we thus present VideoMAP+, a successor of VideoMAP, which has an extendedcapability of supporting the hybrid approach to video retrieval through integrating the query-based (database)approach with the CBR paradigm.

1.2 Paper Contribution and Organization

In order to develop an effective video retrieval system, one should go beyond the traditional query-basedor purely content-based retrieval (CBR) paradigm. Our standpoint is that videos are multi-faceted dataobjects, and an effective retrieval system should be able to accommodate all of the complementaryinformation descriptions for retrieving videos. In this paper, we discuss the main issues involved indeveloping such a comprehensive video database management system supporting hybrid retrieval.

The rest of our paper is organized as follows. In next section we review some related on video processingand database management. Section 3 is devoted to the introduction of the hybrid approach to video retrievalundertaking by VideoMAP+; the CBR and query-based retrieval methods are elaborated and their integrationinto a single language framework is presented. In section 4, we describe an experimental prototype systemwhich we have been building, by highlighting on the main user interface facilities; sample queries are alsogiven to illustrate the expressive power of this mechanism. Finally, we conclude the paper and offer furtherresearch directions in section 5.

2. Related Work

There has been significant interests and considerable amount of research in developing managementsystems for video databases in recent years. Here we review some existing work with an attempt to compareand contrast different approaches of modeling and managing video data. While there have been severalresearch projects on video databases initiated, the following are what we regard as representative ones which

EDICS

support either content-based search or annotation/query-based retrieval techniques in their models andsystems. Besides, some existing work on hybrid retrieval of image and video is also discussed below.

2.1 Content-based retrieval querying

A lot of existing work deals with content-based retrieval; a good survey on this topic is given in [YI99].Dimitrova and Golshani [DG94] developed Rx to retrieve video data based on the trajectory of objects. Theymodeled video with multi-resolution semantic hierarchy, but the object descriptor in this model is placed at alayer lower than image features. Hence it lacks the independency of physical video data and conceptualmetadata, and can handle only the semantics of moving trajectory. They have also developed a visuallanguage VEVA for content-based video retrieval [GD98], which can formulate queries to access semanticinformation contained in digital video, and motion information as well. At IBM Almaden research center, asystem called QBIC [Fli+95] was developed to support image/video data access using color, shape, textureand various properties of raw data. However, QBIC does not have a special abstraction model and can onlyclassify raw data into physical data or feature data. Smoliar and Zhang [SZ94] proposed a video indexingand retrieval approach by focusing on the content analysis process. They emphasized on the need of a goodmodel to represent analyzed semantics efficiently, but did not provide an efficient mechanism to manage andquery video data. Swets, Pathak and Weng [SPW98] developed an image database system which supportsquery-by-image-example, in additional to simple alphanumeric queries which involve basic field name/valuematching.

2.2 Video modeling and querying

There has also been some work on video modeling and querying. Chua and Ruan [CR95] presented a two-layered (shot layer and scene layer) conceptual model, in which the shot layer contains a collection of videoframes and the scene layer is used to model the domain knowledge and contextual information of the videos.This separation, however, is based on the granularity of concept, not on different properties of data. There arealso few facilities provided for declarative and associative search. Gibbs et al. [GBT94] proposed a model forthe interpretation, derivation, and temporal composition of media objects. A media object is defined as atimed audio/video/audio-video stream. While this model provides good support for the editing andpresentation of audio-video content, it lacks an easy retrieval facility to audio-video streams based oninherent items of interests or metadata. Hjelsvold and Midstraum [HM94] presented a video data model and aquery language extension. They adopted the stratification approach and proposed two levels of abstraction as theentire video document and the individual frames. They introduced a logical concept named video stream, whichallows users to annotate a description on any part of the video data. However, this annotation scheme depends onthe physical structure of a video stream, and hence lacks data independency. Oomoto and Tanaka [OT93]proposed an object-oriented video data model, OVID, in which they introduced the notion of video objects tofacilitate the identification and the composition of meaningful features. This model is “type-weak” and offers aflexible framework for organizing a lot of video descriptional data, but it does not provide an efficient structureof the descriptional data nor a clear separation of temporal data from descriptional annotations. Schloss andWynblatt [SW95] introduced a layered multimedia data model and provided repositories of reusable data shared

EDICS

among various applications. They separated the concept of multimedia data into data definition, presentation andtemporal structure, but they do not provide conceptual structure for efficient query processing. Jiang andElmagarmid [JE98] presented a video database system called WVTDB which supports semantic modeling andcontent-based search on the world wide web (WWW). Lee et. al. [Lee+99] developed an icon-based, graphicalquery language GVISUAL which allows a user to specify temporal queries using iconic/graphicalrepresentations, but no support facility for content-based retrieval is provided.

2.3 Hybrid image and video retrieval

In addition, some work on hybrid query retrieval of image and video was surveyed and reviewed in

[YM98]. Chabot [OS95] is a picture retrieval system and its object identification is based on color analysissemi automatically. It uses both keywords and image features for retrieval. In [CH92, GWJ91], the authorsgave an idea of building a hierarchy of image representations from raw image data to objects and relations atthe user semantic level. Chang and Hsu [CH92] analyzed raw image data in terms of their geometric patterns,scenes with semantics, and some meaningful entities. The overall information could be utilized in spatialreasoning and image retrieval. Gupta, Weymouth and Jain [GWJ91] have developed a VIMSYS model forquerying a pictorial database. SEMCOG [LCH97] is an object-based image retrieval system whichintegrated semantic and cognition-based information for retrieval. Besides hybrid image retrieval, there havebeen some hybrid video querying systems [ZLSW95, ZWLS95, CA96]. Zhang et. al. [ZLSW95, ZWLS95]parsed and decomposed raw video data into shots and scenes automatically; while JACOB [CA96] usedcamera operations to split a video into shots. They both use keywords and low-level image features for videoretrieval and browsing.

2.4 Relevance to MPEG-7

MPEG-7 standard, a means of attaching metadata to multimedia content, is often called “MultimediaContent Description Interface”. It aims at providing a rich set of audiovisual description tools to describemultimedia content data, which will support some degree of interpretation of the information’s meaning. It isintended to describe audiovisual information regardless of storage, coding, display, transmission, medium, ortechnology. Audiovisual data content may include still pictures, graphics, 3D models, audio, speech, video,and composition information about how these elements are combined in a multimedia presentation. Specialcases of these general data types may include facial expressions and personal characteristics. MPEG-7 workcan be separated into three parts: Descriptors, Description Schemes, and a Description Definition Language.Descriptors are the representations of low-level features. Description Schemes are structured combinationsof Descriptors, and they can be used to form a richer expression of a higher-level concept. The DescriptionDefinition Language is the language that allows the creation of new Description Schemes and Descriptors. Italso allows the extension and modification of existing Description Schemes.

EDICS

There are many MPEG-7-related projects being undertaken within commercial enterprises, particularlybroadcasting and digital imaging companies.ψ [HARMONY] is a three-way International Digital LibrariesInitiative project between Cornell University, the Distributed Systems Technology Centre, and theUniversity of Bristol’s Institute for Learning and Research Technology. Its objective is to develop aframework to deal with the challenge of describing networked collections of highly complex and mixed-media digital objects. The research draws together works on the RDF, XML, Dublin Core, MPEG-7 andINDECS standards, and focuses on the problem of allowing multiple communities of expertise (e.g., library,education, rights management) to define overlapping descriptive vocabularies for annotating multimediacontent.

DICEMAN (Distributed Internet Content Exchange with MPEG-7 and Agent Negotiations) project is anEC-funded project between Teltec Ireland DCU, CSELT (Italy), IBM (Germany), INA (France), IST(Portugal), KPN Research (Netherlands), Riverland (Britain) and UPC (Spain) [DICEMAN]. Its objective isto develop an end-to-end chain for indexing, storage, search and trading of digital audiovisual content. Thetechnical aspects of this project are mainly: MPEG-7 indexing through a COntent Provider's Application(COPA); the use of Foundation for Intelligent Physical Agents (FIPA) to search and locate the best content;and support for electronic commerce and rights management.

The A4SM project, which is based on GMD's IPSI (Integrated Publication and Information SystemsInstitute), is currently researching the application of IT support to all stages of the video production process

[IPSI]. The purpose is to seamlessly integrate an IT support framework into the production process, i.e., pre-production (e.g., script development, story boarding, etc.), production (e.g., collection of media-data byusing an MPEG-2/7 camera, etc.), and the post-production (support of non-linear editing). In collaborationwith TV-reporters, cameramen and editors they have designed an MPEG-7 camera in combination with amobile annotation device for the reporter, and a mobile editing suite suitable for the generation of news-clips.Overall, the MPEG-7 standard and its related projects concentrate on content description and metadataattachment to multimedia data. Few facilities and little support have been provided by them in terms ofvideo query formulation, processing, and retrieval, which are exactly the main theme of this paper.

3. Hybrid Approach to Video Retrieval

In a Video Database Management system (VDBMS), there exists an important need for efficient retrievalfacility of the voluminous data. Accordingly, many ways are put forward. Content-based retrieval, whichuses visual features such as color, texture and shapes, provides a direct and an intuitive approach for videodata. But for too complex visual information and the limitation of computer vision technique, CBR itself isψ MPEG-7 is able to actually support a broad range of applications, and can make the web as searchable for multimediacontent. It will allow also fast and cost-effective usage of the underlying data, by enabling semi-automatic multimediapresentation and editing.

EDICS

inadequate for video management. This is because many (semantic) features of video data cannot beextracted out from the video itself automatically; moreover, video objects may share annotations ordescriptions. Consequently, it is necessary and cost-effective to complement content-based retrieval withdeclarative, efficient query-based retrieval in the video database system.

To address these problems, we have been developing an object-oriented VDBMS to complement content-based access with high level (declarative) query-based retrieval. As video objects by their nature are rich inspatial and temporal semantics, a versatile modeling mechanism is devised to support Spatio-Temporalreasoning. Associated language facilities are also being developed to accommodate a wide range of videoqueries and annotation activities. In this section, we will firstly review the main facilities of our originalVideoMAP - an object oriented video database system supporting Spatio-Temporal reasoning, including itslanguage CAROL/ST. Then we describe our approach to integrate the CBR mechanism into it to facilitate ahybrid approach for video retrieval, and detail the extended language CAROL/ST with CBR functionalities.

3.1 VideoMAP: Basic Features and Functionality

Our Video DBMS - VideoMAP has its architecture as shown in Figure 3.1. It shows (a) the process ofvideo classification and segmentation from (a.1) to (a.3); (b) the process of adding features into CCM/ST,both general and spatio-temporal, and building Scene-order/Feature-index trees from (b.1) to (b.2); (c) theprocess of adding spatio-temporal features into CCM/ST by an Event/Action model from (c.1) to (c.2); and(d) the process of querying from (d.1) to (d.3). If the user makes an Event/Action query, the process of queryis from (d.1), and then to-and-from (e), then to (d.2) and (d.3).

(a.2)

Segmented

video &

Domain

knowledge(a.3)Segmented

video tree(d.3)View ofvideoscenes

(c.2)Features(d.2)Target scene

EDICS

S = <A, M, X>

where A is a set of scene attributes, M is a set of scene methods, and X is a set of the Feature-Videoassociations:

X = {<Fi : Vi: [STi]> | 1 ≤ i ≤ n}

where Fi is a feature, and Vi is a set of video objects that carry the feature within the scene, and STi is a ST-Feature which is optionally included in the Feature-Video association. ST-Feature is used when an imageobject can be identified from the video objects and when an image object is represented by a ST-Featurewhich contains a set of “Position and Frame numbers”, which means there is a movement of the objectconcerned. A feature Fi can be described as follows:

Fi = <A’, M’>

where A’ and M’ are a set of attributes and methods defined by the feature Fi; these attributes and methodsare applicable to the video players of the feature. In our Video DBMS, a feature is the same as a semanticdescriptor (a glossary term) that is of interest to end-users.For video applications, an important concept to support is Table of Contents (ToC) which describes thestructural as well as sequencing relationships of video data within a video program. Within a Scene-Feature-Video association, the video objects linked by the feature should form a subset of the constituents includedby the associated scene. In order to support ToC in CCM, the notion of scene is extended by defining atemporal scene St as follows:

St = <A, M, X, T>

where A and M denote attributes and methods defined by the scene St respectively, X is a set of the Feature-Video associations, and T is a template of showing the temporal ordering of the video objects in the scene St.In CCM/ST, two kinds of time dimension in temporal databases: (valid-time and user-defined time) arechosen to be supported due to their applicability to video data. Valid-time concerns the time a fact was truein reality; User-defined time (such as birthdate) provides a new data type (e.g., Date-time) to the user

[TCG+93]. In addition, in order to provide a more flexible way to users, CCM/ST is designed to supportvalid-time at both “Object level” and “Attribute level” [TCG+93], as shown in Table 1. Therefore, CCM/STcan store the past histories and future plans of scenes and features, and thus can provide different versions ofa video program.Time dimension

Valid-time

Valid-time

Valid-time

User-defined timeLifespanObject levelAttribute levelAttribute levelAttribute levelAssociated objectScene / FeatureSceneFeatureScene / FeatureAssociated object componentAttribute, Method, Feature-Video association, ToCAttribute, MethodAttribute

Table 1. Time dimensions supported by CCM/ST

There are several types of temporal functions and operators, e.g. interval specification functions, intervalordering functions, interval ordering operators, and interval comparison operators [SHC98, VTS98]. Some

EDICS

temporal functions adopted in CCM/ST are shown in Table 2 and Table 3. Besides, the Interval orderingoperators are extended from seven to nine operators as shown in Table 4. As the Interval comparisonoperators are used to compare the temporal relationships between two explicit or implicit time/frameintervals, the result of the comparison is either true or false. These operators are used on ToC and Valid-time,the same way as the Interval ordering operators.Function

F_INTERVAL()

VT_INTERVAL()

VALID()

DURATION()

DATE()

INTERVAL()ReturnsFrame interval of video objectsValid-time interval of a temporal instanceCurrent valid elementDuration of frame/valid-time intervalDate instant objectDate interval objectUsed OnToCValid-timeValid-timeToC, Valid-timeValid-timeValid-time

Table 2. Interval specification functions

Function

FIRST()

LAST()

NTH()

PREV()

NEXT()ReturnsFirst elementLast elementThe Nth position of elementPrevious elementNext elementUsed OnToC, Valid-timeToC, Valid-timeToC, Valid-timeToC, Valid-timeToC, Valid-time

Table 3. Interval ordering functions

In order to support spatial semantics of video data, the following spatial functions and operators have beenintroduced in CCM/ST.

(A) Spatial functions = {TOP(), BOTTOM(), LEFT(), RIGHT(), MIDDLE()}

(B) Spatial operators for horizontal dimension = {LEFT, H_MIDDLE, RIGHT}

Spatial operators for vertical dimension = {TOP, V_MIDDLE, BOTTOM}

EDICS

(C) As in [NSN95], the 1-D operators can be used in pairs to form 2-D ones for identifying objects in bothhorizontal and vertical dimensions. Such 2-D operators can be used to specify queries such as “whichobject is at TOP_LEFT" within a specified Frame contained in a SCENE, etc.

3.1.2 CAROL/ST: a query language supporting spatial and temporal primitives

VideoMAP+ has a query language (CAROL/ST) which is devised based on CCM /ST features. Fiveprimitive video retrieval functions have been supported, namely:

select scenes/features/videos by their attribute restrictions

select video players (video objects) by scene’s restriction

select scenes by their features’ restrictions

select features by their scenes’ restrictions

select scenes by their video players’ restrictions

In addition, CAROL/ST provides a set of expressive spatial and temporal primitives. These temporal andspatial operators are introduced based on CCM/ST model; the following query examples demonstrate howthey work.

(A) Example: Consider that a user wants to retrieve all the video objects about “SportsNews” before a given date(say, today). The query can be specified by the following CAROL/ST statement (x is a set of scene objects)φ:

SELECT x FROM SCENE y

WHERE y HAS FEATURE SportsNews

AND VT_INTERVAL(y) BEFORE CURRENT_DATE;

All video objects are extracted from the result of the above query, x, which contains the feature“SportsNews” and the valid-time of each of the retrieved scenes is before current date. Therefore,“histories” of scenes about SportsNews can be retrieved.

(B) Example: To specify a query which is to retrieve some video clips showing that Clinton is with Lewinsky forno less than 4 seconds, the following CAROL/ST statement can be formulated (x is a set of scene objects):

SELECT x FROM SCENE y

WHERE y HAS FEATURE Clinton AND y HAS FEATURE Lewinsky

AND F_INTERVAL(Clinton) EQUAL F_INTERVAL(Lewinsky)

AND DURATION(F_INTERVAL(Clinton)) >120;

The query above retrieves a scene, which has features “Clinton” and “Lewinsky”, and the two features havethe same interval of the same video object with more than 120 frames (about 4 seconds).φ x is selected from scene objects, therefore it is a set of scene objects.

EDICS

3.1.3 Specification Language

As shown in Figure 3.1, VideoMAP has a specification language (or content/semantic descriptiondefinition language) which can be used by expert users to annotate user-interested Event/Action semanticsinto CCM/ST. It is also based on the spatio-temporal reasoning, which can be embedded in the Feature IndexHierarchy. In VideoMAP, basic indices are first generated by the Video Classification Component (VCC),which are then grouped and linked by the CCM/ST. These indices are normally the key frames of the videoobjects. In order to extract and compare the image features of the video objects spatially, ST-Feature isintroduced to CCM/ST to accommodate possible space-/time- dependent annotations. The structure of theST-Feature is as follow:

ST-Feature = <{Position-array, Start-frame, End-frame}>

where Position-array is the spatial representation of the image feature, and together with Start-frame andEnd-frame store the duration of the feature denoted by ST-Feature. Note that ST-Feature is used when animage object can be identified from the video objects. Besides static object, ST-Feature can represent amoving object too. Thus, if there are some features in the same scene, by using an Activity model shown inFigure 3.2, we can decide if some events and actions happened in the video object.

Figure 3.2 Activity Model and a “Football” example

In Figure 3.2, the left-hand side is the architecture of the Activity Model and right-hand side is a“Football” example. The model consists of four levels, Activity, Event, Motion, and Object. The top threelevels are mainly used by the query language processor to reason what activity the user wants to retrieve, sothat the processor would retrieve the video scene from the database according to the objects (i.e., featureswith spatio-temporal semantics) from the bottom level of the model.

A user may input an activity into the Specification Language Processing component (shown in Figure 3.1(c.1)) by using terms such as “playing sports”, “playing football”, “kicking football”, or “kicking”. Morespecific terms would yield more specific retrieval results. In general, the first word is known as a verb andthe word followed is a noun. The processor would analyze from the Motion level first. After some keywordsare matched, the processor would search up to the Event level using the second word. For example, if theterm is “kicking football”, the processor searches “kicking” from the Motion level, and then uses “football”to search from the Event level. If the term is “playing football” and there is no “playing” in the Motion level,the processor will try to reason the thesaurus of the word and then search again. However, if there is nomatch of word from the model, the processor would just skip the “verb” and search the “noun” from theEvent level to the Activity level. After the threshold of the search is met, the processor would go down to thecorresponding Object level. Then it would input those objects from the Object level into the Feature Index

EDICS

Tree as features and ask the user to input some spatio-temporal semantics (ST-Feature) into the database(shown in Figure 3.1 (c.2)).

At a later time, the user may want to retrieve video data based on some activities from the database. Forexample, he may input an activity query like “kicking football”. The Query Language Processor first getssome collections of objects from the Activity Model (shown in Figure 3.1 (e)) and then retrieves the result asthe original query processing (CAROL/ST) by treating the collections of objects as Features and ST-Features. Therefore, the main purposes of the Activity model are to facilitate annotating all common andsignificant activities.

3.2 CBR Extension to CAROL/ST

While CAROL/ST can facilitate effective retrieval based on rich semantics, for multimedia data such asvideo, visual content is also an inseparable (and can be more significant) part, which is difficult to bedescribed with text. On the other hand, content-based approach to automatically extract and index visualfeatures has been a main trend in the area of computer vision and video processing. To employ best strengthsfrom both areas, an extended version of VideoMAP, which we termed as VideoMAP+ [CWLZ01], isdeveloped for supporting hybrid retrieval of videos through both query-based and content-based accesses.Here we adopt visual content to our prototype only.

Figure 3.3 Architecture of VideoMAP+

The architecture of VideoMAP+ is as shown in Figure 3.3 (which is a modified version of Figure 3.1). Here,the Feature Extraction Component (FEC) is newly added in. During the procedure of Video Segmentation(by VCC), visual feature vector of the video and other object defined in are extracted, such as the color,texture, shape and so on. The Hybrid Query Language Processing module contains three kinds of retrievalformat: CAROL/ST Retrieval-the original retrieval format which mainly uses the semantic annotation andspatio-temporal relation of video. The Content-based Retrieval-module supports the newly added retrievalformat that mainly uses the visual information inherent in the video content, and also their HybridCombination Retrieval. CBR query functions are incorporated to form a hybrid query language. Hence theindices are now based on more video objects and the returning result also includes more video object types.

EDICS

3.2.1 Foundation Classes

VideoMAP+ extends a conventional OODB to define video objects through a specific hierarchy(videoàsceneàsegmentàkeyframe). In addition, it includes the concept of CBR to build index on visualfeatures of these objects. Their class attributes, methods and corresponding relations form a complexnetwork (or, a "map" as shown in Figure 3.4). Below we enumerate the foundation classes of theVideoMAP+ objects at various granularities, namely: Keyframe, Segment, Scene, Video and Visual object(cf. Figure 3.4).

VideoMAP+ is at video segment level. This is not the only bridging level possible, as others (such as thekeyframe and/or scene levels) are also meaningful for bridging the two. In VideoMAP+, the segment level ischosen as the direct bridge due to simplicity and efficiency reasons, because we regard video segments asthe basic unit of retrieval.

3.2.2 Search paths with CBR

After integrating CBR with CAROL/ST, three main groups of objects (i.e. Keyframe, Visual-Object, andImage-Feature) are added to the VideoMAP+ system as shown in the class diagram (Figure 3.4).

Image-Feature: Visual Feature extracted from video object, like color, texture, shape and etc.

Keyframe: The fundamental image frame in video sequence.

EDICS

Visual Object. All salient objects captured in a video’s physical space represented visually or textually areinstances of a physical object. Furthermore, every object has the spatio-temporal layout in the imagesequence.

Four new entry points to search for semantic-feature and visual-object are:

(a) Visual-Object,

(b) Image-Feature,

(c) Activity Model, and

(d) Object Level of the Activity Model.

The Object Level of the Activity Model [CL99b] contains annotated objects copied from the ActivityModel, and it has links which link the semantic feature objects and the visual objects together.(1) There are two entry points to search for semantic feature. Entry point one is simply to search from the

root of the semantic feature object collection. Another path is to search from the object level of theactivity model, and then to retrieve the semantic features.

Entry point 1: Semantic-Feature

Entry point 2: (d) à Semantic-Feature

(2) To search for visual objects, there are also two entry points. Entry point one is simply to search from the

root of the visual object collection. The other path is to search from the object level of the activity model,and then to retrieve the visual objects.

Entry point 1: Visual-Object

Entry point 2: (d) à Visual-Object

(3) It is possible for a user to search for activities occurred in a video program. Entry point one is to search

from (c) to get the conceptual objects occurred in the activity. The conceptual objects are used as thesemantic feature to find whether there are some video segments linked with the semantic features. Entrypoint two is to search from (c) to get the conceptual objects. The conceptual objects are used as thevisual objects to find whether there are some video segments linked by the visual objects. Entry pointthree is to search from (c) to get the conceptual objects. Then these objects are used to search from (d)to obtain the related semantic features and visual objects. Since there are some direct links from (d) tothe video segments, the last step is to search for the video segments linked by them.

Entry point 1: (c) à Semantic-Feature à Semantic-Feature & Segment List à Semantic-Feature /

Segment-List à Segment[Constraints: the semantic features are linked in the same video segments,and they all contain spatio-temporal features]

Entry point 2: (c) à Visual-Object à Visual-Object List à Keyframe à Keyframe List à

Segment

Entry point 3: (c) à (d) à Segment List à Segment[Constraints: the semantic features and visual

objects are linked together in the same video segments, and the semantic features contain spatio-

temporal features]

EDICS

There are two paths to search for the low-level image features. Entry point one is simply to search from theimage feature object collection. Entry point two is to search from (d) to get the visual objects. Then, usingthe link between the visual object and the image feature object, low-level image feature can be obtained. Entry point 1: Image-Feature

Entry point 2: (d) à Visual-Object à Image-Feature(1) To search for the low-level image feature of a semantic feature object, there are two search paths that

can achieve. Entry point one is to search the image feature from the collection of image feature objects.Then, it is to get the visual object that is pointed by the image feature object. Using (d), semantic featureobject can be retrieved. Entry point two is similar to entry point one but it starts from the end of theentry point one.

Entry point 1: Image-Feature à Visual-Object à (d) à Semantic-Feature

Entry point 2: Semantic-Feature à (d) à Visual-Object à Image-Feature

(2) To search for the low-level image feature of a visual object, the search paths are shown as follows.

Entry 1. Image-Feature à Visual-Object

Entry 2. Visual-Object à Image-Feature

(3) To search for activities occurred in a video program, it is first to use the low-level image features to

search for the semantic feature objects and the visual objects (refer to the above point (1) and (2) of theHybrid Search). Then follow the Entry point: Semantic-Feature & Segment List à Semantic-Feature /Segment-List à Segment.

3.2.3 Additional retrieval method

With respect to CBR, some additional video object comparison operators are defined and introduced intoour query language, as summarized below.

(1) Keyframe similar_to Keyframe

By color, shape, texture, or any combination;

By visual objects (i.e. the number of similar visual objects).

(2) Segment similar_to Segment

By the number of Keyframes deemed as similar (e.g. >50%);

By the temporal ordering of Keyframes.

(3) Scene similar_to Scene

By the number of Segments;

By the temporal ordering of Segments;

By the number of Activity/Event/Motion/Object from the Activity Model [CL99b].

(4) Video similar_to Video

By the number of Scenes;

By the temporal ordering of Scenes;

EDICS

Those method are based on a content-based video similarity model, whose implementation details aregiven in and can be referred to [WZ00b].

3.3 Language Syntax and Query Refinement

3.3.1 Syntax of CAROL/ST with CBR

The complete syntax of VideoMAP+ query language (i.e. CAROL/ST with CBR) is summarized below,with the shaded clauses being the ones not yet supported in the current version, but planned to be devised: SELECT <object | attribute of object> [{,<object | attribute of object>}]

FROM <object class> [{,<object class>}]

[ WHERE <search condition>

[{AND <search condition>}] ]

[GROUP BY <object | attribute of an object> [HAVING <search condition>]]

[ORDER BY <object | attribute of an object> [ASC | DESC] ] ;

For the WHERE clause, there are four kinds of search conditions:

1. Composite condition:

<object variable 1> HAS <object class> <object variable 2> [BY OID=<object id> | BY

ONAME=<object name>]

If object variable 1 is linked with another object (either through the inheritance or composite

relationships), object variable 1 can be retrieved by the composite condition.

2. Comparative condition:

<attribute of object> <comparison operator> <value>

If an object contains an attribute that is a simple data type, the attribute can be compared with the samesimple data type.

3. Spatio-Temporal condition:

<object variable> AT <location operator> [FOR <comparison operator> <frame number>]

If an object is annotated with some spatio-temporal features, it can be retrieved by specifying its locationand time duration.

4. Similarity condition:

<object variable 1> SIMILAR_TO <object class> <object variable 2> [BY OID=<object id> | BYONAME=<object name>] [BY [COLOR | TEXTURE]]

If an object is associated with some visual feature objects, the object can be processed with the similaritymeasurement of images. Therefore, the <object class> above should be restricted to Keyframe, Segment,Scene, and Video objects.

EDICS

3.3.2 Query refinement and feedback

Supported by CAROL/ST with CBR, queries in VideoMAP+ are flexible and able to accommodatevarious requirements. In particular,

a single feature query will be simple to handle, since it uses the separated search schema: text-based

retrieval or content-based retrieval. Text-based retrieval requires exact match; however, content-basedretrieval adopts inexact ("fuzzy") match.

a range query is a query that explicitly specifies a range of values for the feature weight, for example:

50% color, 30% texture, 20% shape. The weight for feature vector can be specified too.

a heterogeneous feature query such as “find video segments similar to the sample segment1, and in

addition satisfy the annotation restrictions” may involve convolution. Since text-based and content-based are both considered, different search paths and match methods should be incorporated. Thisproblem can be tackled as follows:

1) Intersection operation: The query can be any video object, maybe Keyframe, Segment, Scene,Video or Feature. Text-based semantic description can be used to narrow down the scopes of search.Only video object containing the specified semantic or spatio-temporal feature will be considered inthe retrieving process.

2) Join operation: Given a hybrid query, different sets of video object can be selected based onindividual feature. The final ranked set of similar video objects to the query is derived by joinoperation.

Since Content-based retrieval is essentially fuzzy retrieval, it involves the problem of setting similaritythreshold; k-means classification algorithm is used to dynamically find similar results, so in most situation,we get a set of similar video objects, not just a single one. While not all the video objects in the result set canfit for user’s requirement, or maybe new query ideas from the user can emerge at any time, furtheradjustment like query refinement and/or feedback is needed in VideoMAP+. Several ways are thus providedin order to support such query refinement.

(1) Users can assign arbitrary video object from the result set as the starting point for a new round of query

iteration.

Under most situations, the query example used in the first round of iteration is not the best model forwhat the user actually wants to find, maybe it just comes from a coarse concept (a vague idea) that the useroriginally has. But after the result set gets returned, the user will be more clearly about what s/he wants andbe able to pick a closer query example as the starting point for new query iteration.

(2) Users can give feedback advices like “Relevant’, “No opinion”, “No Relevant” for the video objects in

the result set.

By feedback advice, user’s preference and intention on different features for the query can be analyzed.For example, ifs/he prefers the red color more, then more weight on red color can be given so as to find morered objects. Feature weights can be adjusted in this way, and new query starts again. The feature weightadjustment algorithm we have been using is described in [WZ00b].

EDICS

3.4 Summary

So far, we have described a hybrid approach to video retrieval in a comprehensive object-oriented videodatabase system, based on the spatio-temporal semantics and visual features of video data. In addition, aquery language supporting heterogeneous query (spatio-temporal reasoning query, semantic query andcontent-based query), namely, CAROL/ST with CBR, has also been introduced. It not only can overcomethe drawback of CBR for the limitation of current computer vision and image processing techniques inextracting high-level (such as motion) semantics, but also bypasses the difficulty of describing visual contentpurely by text.

4. An Experimental Prototype

As part of our research, we have been building an experimental prototype system on the PC platform. Inthis section, we briefly describe our prototype work in terms of the implementation environment, the actualuser interface and language facilities, and sample results.

4.1 Basic Implementation Environment & Functionalities

Starting from the first prototype of VideoMAP, our experimental prototyping has been conducted basedon Microsoft Visual C++ (as the host programming language), and NeoAccess OODB toolkit (as theunderlying database engine). Here we briefly introduce the main user facilities supported by VideoMAP+,the kernel system based on which our subsequent (web-based) prototype [LCWZ01] is being developed(using IONA’s developer-focused CORBA product, ORBacus, with necessary extensions). VideoMAP+currently runs on Windows and it offers a user-friendly graphical user interface supporting two main kindsof activities: Video editing, and Video retrieval.

4.1.1 Video Editing

When a user invokes the function of video editing, a screen comes up for uploading a new video into thesystem and annotating its semantics (cf. Figure 4.1). For example, s/he can name the video object and assignsome basic descriptions to get started. A sub-module of Video segmentation is devised to help decomposethe whole video stream into segments and to identify keyframes. Further, the Feature Extraction module is tocalculate the visual features of the media object. By reviewing the video abstraction structure composed bythe segments and keyframes, the user can annotate the semantics according to his/her understanding andpreference (see Figure 4.1). As a result, a video object is created and linked with a number of video segmentobjects. Each video segment objects contained keyframe objects and the low-level image features.

After creating and importing several video objects into the database, the user can create scene objects in atree-like structure, and semantic feature objects in another tree. The user can also create attribute and methodobjects associated with the scenes and semantic features. A scene object (e.g. White_House_News) maycontain a ToC object that can link up some video segment objects defined in several video objectsdynamically. Semantic feature objects (e.g. Clinton, Lewinsky) can act as indexes which are attached to the

EDICS

video segment objects in the ToC of a scene object dynamically. (For instance, Clinton may be attached toseveral different video segment objects in the ToCs of different scene objects; whereas Lewinsky may beattached to the others.) Users can specify a query to retrieve all scene objects of different video sources, byimposing the condition that the scene objects should contain both the semantic feature objects.

An activity hierarchy is created in the Specification Language component (see Figure 4.2) in order tofacilitate the creation of visual objects and query retrieval. The Specification Language model can be used toannotate the spatio-temporal information to the semantic feature objects (cf. Figure 4.3), with a visual object

being created and attached to the activity hierarchy meanwhile.

Figure 4.1 Annotating the segments after video segmentation

Figure 4.2 Creating an Activity Hierarchy in the Specification Language dialog

EDICS

Figure 4.3 Annotating spatio-temporal information and creating visual object by using Activity

Hierarchy

4.1.2 Video Retrieval

VideoMAP+ also provides an interface for the user to issue queries using its query language (i.e.CAROL/ST with CBR). All kinds of video objects such as “Scene”, “Segment”, “Keyframe”, can beretrieved by specifying their semantics or visual information. Figure 4.4 shows a sample query issued by theuser, which is validated by a compiler sub-module before execution. The video objects retrieved are thenreturned to the user in the form of a tree (cf. Figure 4.4), whose node not only can be played out, but also can

be used subsequently for formulating new queries in an iterative manner.

Figure 4.4 Query facilities

本文来源:https://www.bwwdw.com/article/snqi.html

Top