Blog for work on my Masters thesis - a survey of methods for evaluating media understanding, object detection, and pattern matching algorithms. Mostly, it is related to ViPER, the Video Performance Evaluation Resource. If you find a good reference, or would like to comment, e-mail viper at cfar.umd.edu.
Archives
Media Processing Evaluation Weblog
Tuesday, March 15, 2005
Combining Text Lines
AI and some others have been marking up data both at word level and at line level for some time. For the next phase of the VACE text evaluation, we have devloped a custom data type - the textline - for use instead. This should cut down on the duplication of work. Hopefully, it will be quicker to annotate video using the textline than it would be to annotate at the word level, and the quality of the data will be about the same.
The textline shape is essentially a single box for a unit of text with markers indicating word breaks and occlusions. It also contains a link to another attribute which is used for holding the text content. This should allow edit distance computations as well as region based ones.
However, one step is to convert data from the old word+line markup to and from the new textline markup. There will be some data loss going from the old to the new format - words in a line will have to be aligned, adjacent and all the same height - but hopefully not going back. I'm not going to worry about the case of inter-line occlusion. If it becomes an issue, we may have to address it.
The script assumes the existance of two descriptors in the current file: 'Word' and 'Line'. Both need 'Location' and 'Content' attributes. To get the data in this format, first open and modify the Word.xgtf, then import the Line.xgtf and modify its data accordingly. When you have finished formatting the data using the 'File>Import' and 'Window>Schema Editor' tools, you can run the script.
The script combines the two descriptors into one descriptor called 'Text Line'. It doesn't delete the old values. To do that, you can just delete the descriptor types with the schema editor. First, you should check over the data for errors. I recommend locking the old data types by clicking on their tabs, then playing the video back.
import edu.umd.cfar.lamp.apploader.AppLoaderScript import edu.umd.cfar.lamp.apploader.AppLoader import java.util.Iterator import viper.api.Config import viper.api.Descriptor import viper.api.time.Span import edu.umd.cfar.lamp.viper.examples.textline.TextlineModel class CombineLinesAndWords implements AppLoaderScript { static final String myIdentifier = "http://viper-toolkit.sf.net/samples#combineWordsIntoLines" static final String TEXTLINE_TYPE = "http://lamp.cfar.umd.edu/viperdata#textline" static final String WORD_DESCRIPTOR_NAME = "Word"; static final String WORD_LOCATION_ATTRIBUTE_NAME = "location"; static final String WORD_CONTENT_ATTRIBUTE_NAME = "Content"; static final String LINE_DESCRIPTOR_NAME = "Line"; static final String LINE_LOCATION_ATTRIBUTE_NAME = "location"; static final String LINE_CONTENT_ATTRIBUTE_NAME = "Content"; static final String COMBINED_DESCRIPTOR_NAME = "Text Line"; static final String COMBINED_ATTRIBUTE_NAME = "Value"; void run(AppLoader application) { // For each descriptor, clean each dynamic attribute mediator = application.getBean("#mediator") V = mediator.viperData trans = V.begin(myIdentifier); success = false; try { wordConfig = V.getConfig(Config.OBJECT, WORD_DESCRIPTOR_NAME) if (wordConfig == null) { throw new RuntimeException("Cannot find descriptor named ${WORD_DESCRIPTOR_NAME}") } else if (!wordConfig.hasAttrConfig(WORD_LOCATION_ATTRIBUTE_NAME)) { throw new RuntimeException("Cannot find attribute named ${WORD_LOCATION_ATTRIBUTE_NAME}") } else if (!wordConfig.hasAttrConfig(WORD_CONTENT_ATTRIBUTE_NAME)) { throw new RuntimeException("Cannot find attribute named ${WORD_CONTENT_ATTRIBUTE_NAME}") } lineConfig = V.getConfig(Config.OBJECT, LINE_DESCRIPTOR_NAME) if (lineConfig == null) { throw new RuntimeException("Cannot find descriptor named ${LINE_DESCRIPTOR_NAME}") } else if (!lineConfig.hasAttrConfig(LINE_LOCATION_ATTRIBUTE_NAME)) { throw new RuntimeException("Cannot find attribute named ${LINE_LOCATION_ATTRIBUTE_NAME}") } else if (!lineConfig.hasAttrConfig(LINE_CONTENT_ATTRIBUTE_NAME)) { throw new RuntimeException("Cannot find attribute named ${LINE_CONTENT_ATTRIBUTE_NAME}") } combinedConfig = insertCombinedDescriptorConfig(V) sf = mediator->currFile if (sf != null) { oldLines = new java.util.ArrayList() for( d in sf.getDescriptorsBy(lineConfig) ) { oldLines.add(d) } for( d in oldLines ) { combined = sf.createDescriptor(combinedConfig) combined.setValidRange(d.validRange.clone()) for (a in d.children) { if (a.attrName == LINE_LOCATION_ATTRIBUTE_NAME) { newAttr = combined.getAttribute(COMBINED_ATTRIBUTE_NAME) assert newAttr != null : "Cannot find attribute named ${COMBINED_ATTRIBUTE_NAME}" copyInto = {box, line | line.set(box.x, box.y, box.width, box.height, box.rotation)} create = {box | return new TextlineModel(box.x, box.y, box.width, box.height, box.rotation)} transformInto(newAttr, a, copyInto, create) } else if (a.attrName == LINE_CONTENT_ATTRIBUTE_NAME) { newAttr = combined.getAttribute(COMBINED_ATTRIBUTE_NAME) assert newAttr != null : "Cannot find attribute named ${COMBINED_ATTRIBUTE_NAME}" copyInto = {content, line | line.setText(content)} create = {content | return new TextlineModel(0,0,0,0,0, content)} transformInto(newAttr, a, copyInto, create) } else { newAttr = combined.getAttribute(a.attrConfig.attrName) assert newAttr != null : "Cannot find attribute named ${a.attrConfig.attrName}" if (a.attrConfig.dynamic) { for(oldVal in a.attrValuesOverWholeRange) { assert oldVal != null newAttr.setAttrValueAtSpan(oldVal.value, oldVal) } } else { newAttr.attrValue = a.attrValue } } } // now we have a text line that is unbroken. // we can use the 'word' data to split the // lines into word segments. Yeah, I know - wouldn't // it be nice if viper supported relations? And maybe // queries? } for (w in sf.getDescriptorsBy(wordConfig)) { // the basic idea is to put lines at the ends of each word // then go through and remove the first and last line, // and average the lines inside wordAttr = w.getAttribute(WORD_LOCATION_ATTRIBUTE_NAME) for (l in sf.getDescriptorsBy(combinedConfig, w.validRange.extrema)) { lineAttr = l.getAttribute(COMBINED_ATTRIBUTE_NAME) create = {word | return null} transformInto(lineAttr, wordAttr, projectBoxIntoAnother, create) } } for (tl in sf.getDescriptorsBy(combinedConfig)) { lineAttr = tl.getAttribute(COMBINED_ATTRIBUTE_NAME) transformDynamicAttr(lineAttr, cleanLine) } } success = true; } finally { if (trans != null) { if (success) { trans.commit(); } else { trans.rollback(); } } } } cleanLine = {textline | if (textline.obox.area().doubleValue() <= 0) { return textline } textline = textline.clone() oo = textline.wordOffsets sz = oo.size() oo = oo.clone() if (sz > 2) { textline.wordOffsets.clear() java.util.Collections.sort(oo) i = 1 while( i < sz-1 ) { textline.addWordOffset((int) ((oo.get(i) + oo.get(i+1)) / 2)) i += 2 } //str = "Collapsed offsets ${oo} into ${textline.wordOffsets} for ${textline.text}" //java.lang.System.out.println(str) } else { textline.wordOffsets.clear() } return textline } projectBoxIntoAnother = {obox, textline | wordArea = obox.area().doubleValue() wordAndLineArea = obox.getIntersection(textline.obox).area().doubleValue() if (wordAndLineArea < wordArea * .75) { return } s = textline.width * textline.width e = 0 R = java.awt.geom.AffineTransform.getRotateInstance(java.lang.Math.toRadians(textline.rotation)) P = new java.awt.geom.Point2D.Double(0,1) R.transform(P,P) sline = new java.awt.geom.Line2D.Double(textline.x, textline.y, textline.x + P.x, textline.y + P.y) for (v in obox.verteces) { t = sline.ptLineDistSq(v.x.doubleValue(), v.y.doubleValue()) if (t < s) { s = t } if (t > e) { e = t } } if (s < e) { s = java.lang.Math.sqrt(s) e = java.lang.Math.sqrt(e) textline.addWordOffset((int) s) textline.addWordOffset((int) e) //str = "Projecting ${obox} into ${textline} gave offsets ${s} and ${e}" //java.lang.System.out.println(str) } } transformDynamicAttr(attr, Closure trans) { if (attr.range == null) { return } copy = attr.range.clone() for( val in copy.iterator() ) { newV = trans.call(val.value) attr.setAttrValueAtSpan(newV, val) } } transformInto(newAttr, oldAttr, Closure copyInto, Closure create) { // Utility method that transforms the values of newAttr to reflect // information in oldAttr, using the given closures. // If a value exists at the newAttr, then copyInto is invoked. // When no value exists, create is invoked. if (oldAttr.range == null) { return } modify = newAttr.range != null copy = modify ? newAttr.range.clone() : null for (val in oldAttr.attrValuesOverWholeRange) { // To fill in null values, first set the whole range partial = create.call(val.value) assert val != null if (partial != null) { newAttr.setAttrValueAtSpan(partial, val) } if (!modify) { continue } // After the creation, then modify to reflect // partial changes already there already = copy.iterator(val) if (already.hasNext()) { for (partialSpan in already) { partial = partialSpan.value.clone() copyInto.call(val.value, partial) newAttr.setAttrValueAtSpan(partial, partialSpan) } } } } /** * Creates a new type of descriptor, called 'TextLines', * if it doesn't already exist. Otherwise, locates it. * @return the text lines descriptor config information */ Config insertCombinedDescriptorConfig(viperdata) { c = viperdata.getConfig(Config.OBJECT, COMBINED_DESCRIPTOR_NAME) if (c == null) { c = viperdata.createConfig(Config.OBJECT, COMBINED_DESCRIPTOR_NAME) c.createAttrConfig(COMBINED_ATTRIBUTE_NAME, TEXTLINE_TYPE, true, null, new edu.umd.cfar.lamp.viper.examples.textline.AttributeWrapperTextline()) cOld = viperdata.getConfig(Config.OBJECT, LINE_DESCRIPTOR_NAME) for (a in cOld.children) { if (a.attrName != LINE_LOCATION_ATTRIBUTE_NAME && a.attrName != LINE_CONTENT_ATTRIBUTE_NAME) { c.createAttrConfig(a.attrName, a.attrType, a.dynamic, a.defaultVal, a.params) } } } return c } String getScriptName() { "Combine 'Word' and 'Line' objects into 'Textlines'" } }
- posted by David @ 3:30 PM
Monday, March 14, 2005
Dynamic Queries in the ViPER-GT Timeline
Okay, so it is becoming painfully clear that the current timeline is inefficient and ugly. It is often difficult to associate a line with its descriptor visually, especially as display resolution increases and the lines grow farther away from their labels. If we wish to support relations well, and other direct manipulations of the timeline, we have to have a timeline that allows the user to make sense of it. This includes improving the current display, but, more importantly, getting rid of parts of the display that are unnecessary. A good first step would be hiding descriptors that aren't valid on the current frame, or aren't valid within the current time selection, but other, possibly more explicit, dynamic queries will likely be required.
The interface might be shared with other toolits. The basic idea is to support dynamic reordering and selection. This will likely involve some kind of query panel that pulls out (like a tray in mac os x) or pulls over (like a status message in ms windows).
Another key feature that will be required for improved performance is the ability of nodes to render simpler versions of themselves when zoomed out. This will actually have to be built into each node type explicitly, as this is currently how piccolo handles things.
- posted by David @ 5:38 PM
Relations in ViPER-GT
Relations can be simply viewed as foreign keys, with their value the id of an existing descriptor. The relation attribute will have an extended definition, accepting the name of a descriptor type. Enhancements to this would include: multiple relations at a time (allow an event to have any number of participants), linking to something other than an id attribute (specifying some other kind of query, e.g. on name. Would rely on previous requirement for multiple targets), enforcing bidirectional links, enforcing correct links, allowing links to more than one type of descriptors, and allowing links to descriptors defined for other source files. Really, if I were a good xml person, the relation type would be an xpointer of some kind, and descriptor ids would be xml:ids. If I were a good semantic web person, viper's native data format would be OWL. But this is not the way of things.
Like all other data types in viper, the first step is to create a parser object. Like the lvalue, this will have to support extended attrubute configurations. Then, I'll have to add an editor. Unlike the other attributes, modifications to one descriptor will have to check the entire model to maintain consistancy. For example, deleting one attribute will invalidate any relation attributes that link to it. Those attributes should either be set to point to null, or they must be deleted first (or automatically) if the attribute is specified to be non-null. I wonder how this sort of thing is done in rdbs. Anyway, right now, I'll probably just implement that sort of stuff on the UI layer, allowing the model to become incoherent if the UI fails in its task.
Multiple relations (sets), or ordered relations (lists), would be very useful for text, where zones or regions of interest are often nested or ordered. However, with careful design, single typed relations are often enough: child and next links can make most types of structure you would want. So, for now, the only extended attribute for a relation will be 'descriptor', indicating which descriptor id we can link to. The table cell editor will be a combo box, allowing typing numbers, or a drop-down to select from available descriptors. Editing in the chronicle or video frame would be preferable, however.
Editing static relations in the timeline could be as simple as adding more emblems to the descriptor label - one for each relation - and allowing the user to drag from the emblem to the target of the relation. As the user drags, an arrow would follow the cursor, and a tooltip would display relevant information about the dragged-over line's descriptor. For dynamic relations, the problem is more complex. I can see each contguous segment (referring to the same target) rendering as the same color. To create a new relation, you could drag a region to create a new one, then drag from that segment to the target, getting a similar effect as the drag from the static relation emblem. Dragging along the line will modify the values along the line, while dragging orthogonal to the line will change the target of the relation.
- posted by David @ 4:28 PM
two heads are better than one
What is the problem with a video repository? Data currently can be found only by scrubbing through massive amounts of video, resulting in a day of video from a large surveillance installation taking many person-days to properly search. For the most part, these videos are uneventful, with innumerable quotidian noise events distracting from the few important events. Current research into improving the systems focus on two parts: improving the extraction routines, and improving the ability to display many cameras at once. We will present a plan for work into putting the user in charge of the system, creating a set of tools to browse video faster, to find things better, and to teach a muddled vision system how to see. The system will be measured by how quickly and correctly a user can find out certain prototypical bits of information from a video repository, similar to Video Trec but with an emphasis on a PETS-style world.
[Right now, we have picked up two ten-twelve minute long sequences from Vinay, and we're going to try to get a simple system going with them. I would still like to get work going on my planar, piccolo powered dotworld, but that well have to wait for a couple of months until this is working well enough. I will refer to the 2-cam Vinay system as the alpha system.]
Our first, alpha system will be viewed, from the user's perspective, as a set of tools to mine a repository of video for information. From another perspective, it is a set of computer vision and other AI tools that happen to have a user that can provide additional, per query or per data set information. The system should be presented as the automatic processes acting on behalf of the user, and never in contradiction to her. This paper will introduce a few basic ideas for keeping the user central to the system, while trying to avoid unnecessary constraints on the artificial systems doing the menial tasks.
To do this, we can see, from a systems perspective, that the query interface and data browser represents a dialogue between the two agents — human and machine — about the contents of the recorded video. Each agent is attempting to access and organize notes, bits and pieces of data, into an organized set of summaries, stories, clips and ideas.
Our preliminary data model is divided into two sets of entities: actors and clips. Actors, currently people, but possibly any object type, are currently represented to the user as a series of bounding boxes around all images of a person in the video repository. A clip will be a set of video frames, probably contiguous and from a single view, which can represent an event.
A user may not localize the actor in every frame, but may give the actor several key bounding boxes; for example, when the actor walks in and out of the frame, when there is a good shot of the actor's face, or when the actor is viewed in profile. The automatic labeling will likely completely localize the actor's views on screen. An actor may have additional information: a generated ID, a name, notes and comments, and some kind of model, like a histogram or some 3D structure. Each frame may have additional information, such as orientation and some kind of certainty and importance values.
A clip will likewise have extended information. The most obvious pieces will be links to included actors and some sort of classification, but there may also be additional information associated with them — the most useful one I can think of being a representational key frame or two. For some of the extended information, like the key frames, a user's mind may be required. For others, such as the actor histogram, a user need not know the information is there.
For the Level 1 system, the user will create the gallery of clips and actors by hand. For the Level 2 system, the user has access to a set of automated routines to handle most of the repetitive work, hopefully massively reducing the time to populate the gallery. The gallery needs only to be full enough and accurate enough to answer a few questions about the content of the video correctly. It should be noted that the populated gallery is not the end result of the system, and the experiments will not usually explicitly check the quality of its contents. To emphasize this, the Level 0 bootstrap control system will have no gallery at all, and the subjects will answer the questions without access to it (although we may allow them paper and pencil).
For the alpha system, we will
first focus on statements of the form person
X is at location Y.
These can be
interpreted as putting regions of single images into equivalence
classes, where each region can represent some actor or overlapping
troupe of actors. With a complete set of statements of this form, we
may ask such interesting questions as How
often do actors A and B meet?
, Which
actors meet most often?
, and Which
actor meets the most people?
More importantly, it provides a simple language that both agents can
comprehend. The user can modify the gallery, indicating that two actor
objects are actually images of the same person, or that one actor
should be modified somehow, and this information can be recorded
internally as statements in this format. The AI can then generate new
information that does not negate anything the user said, but instead
takes it into account. For example, it can be used to learn better
actor models, or to modify the weights or structure of the track graph.
If nothing else, it can be applied as a post-processing step to the
output of the tracking results. It should be noted that each of these
three techniques has a different level of improvement and a different
level of consistency: with using the user annotation for learning, the
gallery may undergo the most changes, while using it for
post-processing will modify the gallery least. The trade-off between
user expectation and the benefit of automatically performing gallery
modification should be measured carefully.
While the user does not have to be aware of the underlying system, there are times when it may become useful to do so. To correct errors, to provide more insight to how the computer vision system works, and to make up for inefficiencies in the learning routines, we may wish to give the user access to more explicit layers of the system, to allow the user to grind the lens placed in front of the computer vision system more accurately.
The goal of the interface is to present the user with as much relevant information as possible, organized by both agents to maximize efficiency. This includes the ability to directly manipulate the browser and the gallery, which will be tied together to maximize the content displayed on the screen and the ability of the user to interpret it. Utilities such as HCIL's Piccolo and INRIA's InfoViz toolkits should help. [Unfortunately, there is the difficulty of actually displaying live video in java. I blame the lack of a good cross-platform media/gui integrated toolkit. ] Two techniques developed at the University's HCIL come to mind: dynamic queries and snap-together visualization.
Dynamic queries are founded on the idea that feedback while composing queries leads to faster, more accurate queries and allows greater exploration and understanding of a data set. Dynamic queries can be as simple as type-ahead searching, like in emacs or Apple's search widgets, or as complex as the visual query system in Spotfire. Our alpha data set should be simple enough that even complex queries in the gallery should give snappy feedback. [This might not be the case for the browser.] The first gallery will likely present several sorts of views: a view of all people, of all clips, or a summary views of a selected object. The all-people view will present a list of each person's main view - probably user selected; the clip view will show a key frame from each clip. The summary view of each person will present a list of different views, either organized by time or by some other characteristic; the clip summary view will present more thumbnails. An obvious query tool to use would be a range slider.
Snap-together visualization is really a modification of the dynamic query idea to include more implicit query types, usually related to selection. In the snap together visualization paper, for example, clicking on a state field in a spreadsheet highlights the state on a map. The trick for this is allowing the user to connect different controls to the same selection or query model. This technique can be seen in Eclipse plug-ins that support 'link with...' functionality. The snap-together part is unimportant for the alpha system; the links will be fixed. The important part is the live selection; clicking on a person should play a quick summary of the person in the browser, and highlight clips that contain the person in the clips view. [Extensions include: other people the person comes into contact with highlighting another color; expanding that person in-place to display not just a single thumbnail but more thumbnails - a sort of think-map semantic zoom type interface.] The interface will therefore support the idea of 'selected item(s)', with each component having its own reaction to it.
While querying the gallery is important for determining the quality of the information - and actually using it - this leaves the question of creation open. To create a new person, a user can drag a box around the image of a person in a video frame. The box will get added to whatever section of the gallery is selected: meaning, if a person is selected, it will be added as a key frame for that person, but if no person is selected, a new actor object will be created with that bounding box as its primary key frame. Clip creation will require some sort of time line control, where a selection can be dragged to the clip view. For the alpha system, this could be a range slider with some thumbnails attached.
[I'm currently working on the browser. Right now, it is a simple thing that supports playback of several (preferrably quicktime) movies, somewhat synchronized in time. Nagia is the docent, whereas I am the projectionist. ]
- posted by David @ 4:27 PM
Thursday, March 03, 2005
Changes in ViPER 4.0b10
- Support for scripts written in Groovy, in addition to the existing script support. This will allow for a much tighter integration with the code, and allowed me to port what were previously plug-ins, like 'insert i-frame descriptor', to scripts, which are much easier for the end user to add and edit, currently. To add a script, you have to put it in your ~/.viper/scripts directory. I'll work on making a utility to install them for you. There are a few on the bottom of the viper home page now.
- Added of a 'relink' button, so you can change what media file your metadata refers to.
- A bunch of little UI fixes. A massive redo of some of the timeline internals and the toolbars may result in some UI glitches, but I'm working on it. As always, error reports are appreciated, especially if you send screenshots or your ~/.viper/error.txt and ~/.viper/log.xml files along with them.