Home >> Media Group >> Research >> ViPER
Downloads -- Documentation -- Developers -- Discussion
LAMP     The Language and Media Processing Laboratory

Blog for work on my Masters thesis - a survey of methods for evaluating media understanding, object detection, and pattern matching algorithms. Mostly, it is related to ViPER, the Video Performance Evaluation Resource. If you find a good reference, or would like to comment, e-mail viper at cfar.umd.edu.

Archives

Media Processing Evaluation Weblog

Wednesday, March 31, 2004

Scholarly Paper?

Since it looks like I'm not going to get a thesis done in time to get my MS this semester, I am going to do a scholarly paper instead. Now that I've got no reason not to go on to get a PhD, that I what I'm going to do. I'll probably turn a couple of the things I'm working on into TRs for LAMP (notably the stuff about limn3 and, of course, the scholarly paper). This new course of action relies on getting a few professors okay for old exams, and begging a few people for help, as I'm past some sort of deadline.

Improvements to the File Format

As I mentioned recently, there is a need to update the file format. As I work with my RDF N3 based java application launcher, I see a need, if not to convert the file format to RDF or (better) OWL outright, at least to make it somewhat model-compatable - i.e. I need to make all of it URI friendly. This includes smacking id attributes all over the place, and making them of the proper (read:qname) form.

This will allow me to keep user-space information about the files, and perhaps even to have a CC-style rdf comment that includes additional information. I still like the idea of using OWL directly; I'm just worried about it taking too long to implement. (I'd almost certainly have to use OWL-Full.) It mostly depends on me finding a good OWL API for java. [1] [2] [3]

Anyway, the reason I bring this up is that I'd like to allow the user to specify which attributes to display on the video frame next to the boxes (the name), and possibly specify a thumbnail for display in the table or the timeline. It would be easy enough to add these new things to the RDF, and, since I'm already using RDF everywhere, easy enough to get the values. So, I'll probably add them to the user.n3 file for now, but think about adding them to the data file in the future. But first, URI-addressable instance data! It would make more sense for the URIs to be relative to the sourcefile, but that would mean using RDF:about instead of ids. I'm thinking the id will be the form metadata#sfname-desctype-id, where metadata is the URI of the metadata, sfname is the sourcefile name, encoded or cropped, desctype is in the form FILE-Information, and id is the standard text id. This means the input parsing would only have to be modified to use id = id.substring(id.lastIndexOf('-') + 1);, and the XML serializer would have to be similarly modified, and everything else would just work itself out.

Friday, March 26, 2004

Some Notes to a User

To get started, download the sample metadata and video files. When you open the metadata file, it will search for the video file and load that (it will also pop up a dialog asking for you to locate the file the first time; it saves this information in a user preferences file (~user/.viper/user.n3) that you'll have to delete if it accidentally links to the wrong file. Hmmmm.... I'll have to add a ui to fix this.

ViPER has an object oriented view of metadata (really, more like a relational view of metadata), with each row in the table/spreadsheet view corresponding to a single instance object (called descriptors throughout the documentation). In the sample data, there are two types - the PERSON type and the Information type. If you scrub to one of the later frames, you can see some people marked up with torso and body bounding boxes. To modify a box, you can just drag it around. To create a box, you can click the field in the table you want to modify and type it in (x, y, width, height) or draw it in the canvas (click+drag). The selection model is on a field level - right now, you can only select one field at a time. When you click on a box on the video frame view, the appropriate row is selected in the spreadsheet view. Likewise, when you click on a field in the table view, the appropriate spatial item is selected (for the spatial attribute types: polygons, boxes, etc). If the value is NULL, it goes into create mode. I think there is a hotkey to select the next attribute - something like ctrl+], or something.

If you are doing simple existence data, you can drag the bars in the timeline view to quickly set where the descriptors are valid (this is the 'v' checkbox in the spreadsheet). I'm working on adding support for directly editing nominal data values in the timeline - most useful would be enumerations (person 1 is in state B, that sort of thing). Some of the other more useful features include interpolation of spatial values (I think the only way you can do this now is by right/option clicking on the descriptor you want to interpolate and selecting 'interpolate to mark') and moving spatial items while playing the video back (drag a box around, then press 'p' to toggle between play and pause - you will be able to drag the box while the video is playing, and the descriptor will be automatically made valid while you are doing so). Also, there is an undo history and a schema editor under the 'windows' tab - the schema editor will let you add new descriptor types or modify existing ones.

Oh, and the 'p' checkbox means 'propagate'. Everytime you go to a new frame, for example by scrubbing, playing the video, or clicking the 'next' button, the value that the descriptor had will be copied to all frames between the old frame and the new frame, inclusive. This was a little more useful before most of the kinks were worked out of the drag-while-playing feature.

Thursday, March 25, 2004

Alpha 5 or beta 1?

So, I'm about to post a new version, which includes MPEG-2 decoding for windows users (it relies on a SWIGed virtualdub) and a few other new features and bugfixes, including a CVS version of piccolo. If you want the source code to the virtualdub modifications, send me e-mail or post a request to sf.net. I don't think this qualifies as a beta, as there are still a few features I'd like to add (musical score notation, for example) that I have yet to implement. I also need to finish updating the manual to the new version, and I really should add a tutorial section at the beginning of the manual - perhaps I need something like flash's flash movies showing how to make flash movies.

Wednesday, March 24, 2004

ROC Analysis

I may have pointed out that ViPER doesn't come with any tools to do ROC analysis, leaving people to punch in numbers at some web site or another. After I get the beta of viper-gt 4 in the pipeline, I'll be able to spend more time actually implementing such evaluation features. Until then, I will only be writing about them.

So, should viper include another evaluation mode, after object, framewise, and tracking? I mean, tracking doesn't get much use, so it would not be unprecendented. In this case, it might be called ROC, event, or signal evaluation. I think signal is probably the best, and it can apply to both spatial and nominal attributes. I haven't thought of how, though.

Another possiblity would be to modify existing modes to support ROC-type analysis - for example, by using a variety of thresholds or filters in one pass, instead of a single threshold.

Event Detection Evaluation

So, as far as I can determine, the best kind of event evaluation is based on simple existance; it makes the most sense to map the problem to a signal detection problem, and use the standard evaluation techniques from there. But, as people have pointed out, sometimes this is impossible. It is not unexpected to have test data where some events, quite likely the most salient or important events, do not occur. We aren't going to have much test data where banks are really getting robbed, people are really getting shot, or complicated transactions occur.

So, we need to have another way - a method for analytically estimating performance on events not found in the data. I think this is one of the aims of VERL. It, like other logic-based approaches to defining events, lends itself, if not easily, to the idea of such evaluation - by determining the probability of correct detection of the atoms, it should be possible to determine the probability of correct detection of the event. So, we should only need video data that properly exercises the detection of the event atoms, and a set of data that exercises the higher-processes of the event detector based on purely synthetic event descriptions.

Of course, this will only work for event detectors that actually work in the waterfall, black-box-image-processing/object-detection way, which is not required and likely not the best way to proceed. In order to provide good and general evaluation of event detectors, we really need video of the events occurring (and, of course, miles of footage of the events not occurring).

The Need for Tracking Truth

Okay, so one of the reasons we need detailed ground truth at all is that we don't have adequate data. The reason we are developing the truth is to perform experiments - usually, in this case, how well a tracker recognizes and follows objects moving about in videos. The simplest data when (which frames) in the video which item appears in. This is great; it leads itself to nice ROC curves, confusion matrixes, and, perhaps most importantly, it is cheap and accurate. Bounding boxes are open to interpretation (You really should put a few pixels around the text; you should or should not put boxes around parts that are occluded/disconnected; etc.) With the exception of what to do when tracking through total occlusion (most trackers apply a filter that will connect across such occlusions - and this is most likely a good thing), there is little debate about how to put stop and starts on the screen.

Of course, the more general 'existance' tracking truth may be derived automatically from bounding-box or centroid truth (and, with bbox truth, you can filter by size of box, for example, and still do the ROC evaluation. in viper terms, you would make boxes with certain constraints (area less than x, minimum length of a side less than y) as output-filtered, aka mark them as don't-care regions.

The point is we need better control in a world without data to give it. The PETS 2003 paper I recently mentioned includes the idea of synthesizing data from existing data. This goes some way to addressing the problem. The problem is we need to detect how well the tracker tracked person x, so we need two streams, the same except for the existence of person x. Better still, we could have the control stream, and then a variety of test streams with person x moving in different ways or under different conditions. If we have enough data, and the algorithm correctly detects the frames on which person x appears and correctly identifies the frames without person x, it can be assumed that the algorithm is detecting well. Futhermore, it can be assumed that a tracking algorithm is tracking well, if the difference between the output on with x/without x shows the same regions. If the tracker has confidence intervals, it may be possible to directly derive an ROC curve; otherwise, it might be possible to generate an ROC cloud to estimate the true curve (basically, by trying different paramter sets and getting points on the precision/recall tradeoff graph and taking the max curve). (See also: some dude's work on detecting modified images.)

If you could control for it, the algorithm must be picking up on the only change between the controls and the tests - the existance of the person. If the ROC curve is perfect, it must be picking up the track, and, if the algorithm is coded correctly, it should be able to localize those differences within the frame, so it can be assumed, with a large amount of well-controlled (on the frame level) ground truth that the algorithm is tracking as well as it is detecting. However, it is often impossible to get such ground truth (I could imagine it being generated for CNN overlay text tracking, for example, or even fixed-camera surveillance, as is done in Black, Ellis, and Rosin), control must be exerted in another way - by identifiying the parts of the video where the change occurs, though the use of centroid or (better) bounding boxes or (better still) more complicated shapes or bitmaps. In this case, the control is the rest of the image.

For example, one of the main worries about tracking is confusion on the frame. In the synthetic case, we could add other people to the video at different times, and develop a confusion matrix at the frame level. In real data, this is often impossible. Instead, ViPER takes the approach of first finding the best track-to-track matching, then making it possible to compute confusion on those numbers. This can be extended to allow the best match to be found at each frame, then the confusion matrix itself may be optimized to find the true matches. (If recognition exists, then the optimization step is redundant.) Still, it would be nice to have an ROC curve somehow.

Getting an ROC curve implies that you have a signal that you want to detect and its shape describes the trade-off between precision and recall (aka sensitivity and specficity). The area under it is often useful as a metric, with values approaching one better, and those aproaching one half worse. In the case of tracking, it is possible to say that any two pixels from anywhere in the volume are or are not part of the track. This can be the signal - the set of all pixel pairs. Other possibilities include the set of all possible tracks, the set of all pixels, etc. The set of all pixels suffers from not including any element of tracking, while the set of all tracks includes far too much, likely yielding a meaningless ROC curve due to the massive number of possible tracks and the comparatively miniscule number of good tracks. The pixel-pair metric is similar to the topic-tracking metric discussed in [TODO: find the topic tracking paper]. Unfortunately, I have no idea how to compute this; on the surface, it appears ridiculously intractable. The naive way is clearly not doable - the complexity must be on the order of the number of frames, not the number of possible pairs. Not to mention the fact that it is impossible to represent the pairings.

Another possiblity is to redefine the metric space of the ViPER object evaluation so that it can give a result similar to an ROC curve. For example, the algorithm could give several result sets, each with one track defined continuously from most conservative to most permissive (e.g. from no track to a single box taking the entire frame), with the evaluation giving sensitivity and specificity for the best match for the track at each segment. This tracking evaluation could use the hungarian matching; always returning the best match for each truth track will be useful for per-track ROC curves, but the hungarian approach may give more discriminating curves for the whole set of tracks, although it may result in questionable curves for individual tracks.

Friday, March 19, 2004

A Novel Method for Video Tracking Performance Evaluation

This paper, from PETS 2003, presents a mechanism for evaluating surveillance video by compositing exisiting ground truth to generate new synthetic, more complex ground truth. It also includes some metrics for measuring truth complexity (these are very specific to tracking, especially tracking as it is currently implemented). I like the idea of a closed system, and I am certainly all for the automation of GT generation. More, cheaper test data makes for better experimental results.

I like papers that reference my work. I need to write more papers so this happens more often. Anyway, this paper brings my attention to ODViS, and reminds me once more the need to integrate vision systems with ViPER at a more fundamental level.

Reference Link
@article{Black2003
   author = {James Black and Tim Ellis and Paul Rosin},
   title = {A Novel Method for Video Tracking Performance Evaluation},
   inproceedings = {PETS 2003},
   year = {2003}
}

Wednesday, March 17, 2004

Possible Names for AppLoader

  • LAMP AppLoader
  • billingsgate
  • redaction
  • quidnunc
  • plinth
  • limn3
  • n3penthe
  • death adder

Personally, I'm leaning to limn3, which I would pronounce lime, as it probably has the best connotation. It could be confused with the GNU fontutil limn, but, as that program is now deprecated in favor of autotrace by the GNU organization, the confusion should be minimal. Billingsgate is probably my second favorite, but it is already well established as a place in England and New England. Except for the first one (which is what it is called in most of the documentation), the others are mostly jokes. Death adder links in with the name viper, and n3penthe links in with it using RDF, but neither really are descriptive of the framework. Limn3 at least implies what it is trying to do: describe an application so well that it can be used to both run the application and generate documentation for it. To really live up to the name, I'd have to write a GUI front-end that takes a list of modules and lets the user style them, but that would distract me from my own work on ViPER.

AppLoader Paper

I've been writing a short paper describing the application loader I wrote for viper 4. It uses RDF descriptions to launch java applications. In a sense, it is like a limited version of Haystack or Chandler or a version of BML on SW roids. I recently stumbled across another tool, MindSwap's Dynamic Java Class Loader using OWL, that provides a similar functionality that is like a subset of my tool and BML. Reading its documents reminded me that I need to add support for non-no-arg constructors, something both it and BML support.

Improving Event Editing

So, in order to improve event editing, I'm adding a 'nillable' attribute to the data. This way, the time line will know that it is alright to pull up a single lvalue to the top level, as editing the nominal attribute will directly change the validity range of the enclosing descriptor.

This is part of an attempt to add a few needed features to the annotation schema. The other features I need to add are: multiple stream support (with the perStream attribute of the attribute config (and descriptor config, for the validity bit)), really unique ids (for RDF addressing of descriptors - this will, theoretically, allow one descriptor to span multiple source media files), and finally implementing sequences properly. I don't like changing the file format, as it means editing both the annotation tool and the evaluation tool.

The only one that should affect viper-pe's evaluation is the support for multiple streams. I will likely treat each stream as a different time on the same video. I will have to add the 'in_stream' filter (a companion to the 'contains' filter) and the ability to split into seperate files (demultiplexing), which will be useful for transferring to the old, text based GTF file format.

Monday, March 08, 2004

Video Analysis and Content Extraction

I'm back from the March 2004 VACE workshop in LA. It was hosted by USC at the Marina del Rey Marriot. It was an interesting meeting, with most of thte talks concerning representation of video data (VEML, ViPER's markup, MPEG-7, etc.) and event descriptions in the form of VERL.

I hope to put out the first beta version 4.0 of ViPER-GT soon; the demo of it went well, although I did have problems with both the 'move while playing' and 'display with respect to' features. I've also got to fix a few things with the schema editor and enhanced table view before I will feel comfortable calling it a release candidate.

In the meantime, Tony has been working on a .Net version, likely to be called ViPER-GT 5. Given the performance improvements that this includes, and the fewer-platform nature, we will likely have two versions of the annotation tool for some time, with somewhat disparate codebases (hopefully, J# will allow us to share some of the code).


Powered by Blogger