Home >> Media Group >> Research >> ViPER
Downloads -- Documentation -- Developers -- Discussion
LAMP     The Language and Media Processing Laboratory

ViPER Use Cases

These describe some of the uses for which ViPER was or is being designed. These use cases inform the requirements, but certainly do not constrain them. There will be requirements that these cases don't cover, so be careful.

Detection Use Cases

These cases describe the situation where ViPER is used to evaluate object or item detection on individual frames (or pages) without a need to evaluate tracking.

Text Detection

This is a problem of finding text (characters, either overlaid on the video or in the scene) in a noisy environment (video, photographs, whatever). At the highest level is finding frames or video clips that contain text, while at the lowest level is detecting pixels which belong to characters. For the purposes of this use case, we are going to assume that the result of the text detection is such that it can be used to feed OCR software. We used a set of keyframes extracted from a set of fair-use and self-recorded video.

Preparing the Data Set

The text may be evaluated at the block, line, word or character level. Given time and resource constraints, we chose to evaluate at the line level. This may result in improper matches, as current text detection have simple models for text which often combine lines or split them at spaces. This means that the Hungarian, one-to-one object matching method for evaluation is not appropriate for evaluation, or is only useful in conjunction with the aggregate match heuristic.

To better support differentiation of the algorithms, the ground truth is rated by quality from 0 (illegible) to 5 (well defined and clear). It is difficult to define what an OCR algorithm will find legible, so running multiple evaluations accepting [5], [4-5], ... [0-5] as input will give a reasonable curve displaying how a system breaks down. Comparisons between programs at any one level of text quality would not be prudent. The value of any legible strings will be recorded; illegible strings are given a null value. This allows goal-based evaluation, something which is outside of the evaluation scope defined above, but may prove useful later.

In addition to specifying legibility of lines of text and each frame's genre, it is necessary to divide scene text (text on objects in the scene) from graphic text (text in overlay graphics, like pop-up text on CNN). This classification is subjective; text in animations may be considered graphic, and animations may appear over a scene.


There are two major types of evaluation that viper provides that are useful for this type of data: objectwise and framewise. The object evaluation attempts to match the objects together, and counts objects as matches when the 'metric distance,' defined by the evaluator, is less than a given threshold. The frame-by-frame evaluation looks at each frame, and each pixel on the frame, and calculates a standard set of metrics on each frame. We created templates for the two evaluation types and runeval-data files for several different metrics and subsets of data, as well as a few runeval scripts for generating charts comparing the different algorithms with different data sets.

Object Evaluation

The objectwise evaluation, called OBJECT_EVALUATION, has three levels. For a complete description of the evaluation, see the viper-pe user guide. The template is of the form:

OBJECT Text [- -]
    LOCATION : [<boxMetric> <boxThreshold>]

#INCLUDE "../equivalencies.txt"


The first block is the evaluation block - required for any evaluation. Note: Newer versions of the viper-pe command line tool allow multiple evaluation blocks in a single run, but the currnet graphing scripts can only handle raw files with output from one run.

$PR = textdetect.pr
$GTF = all.target.xml
$RDF = all.candidate.xml

$NAME = object-highQ
* <boxMetric> = dice
* <boxThreshold> = .99
* <filter> = READABILITY: == 4 || == 5

#NAME_GRAPH object-highQuality


This small RunEvaluation script, when used with the object template, will run a single object evaluation and create associated graphs, with a dice metric set to the very low threshold of .99 (this should catch most boxes that overlap at all), making the sorted distance curve more meaninful. For these evaluations, we used the aggregate match, specified by passing the parameter target_match to MULTIPLE. However, this does cause some problems. The aggregate match tends to taint all other objects, resulting in all overlap-cliques counting as a single match. This alters the sorted-by-distance match graph, but helps give better precision/recall statistics and supports better case based evaluation for debugging.

Framewise Evaluation

For the framewise evaluation, we focused on the pixels counted as also described in the user.

    LOCATION : <frameMetrics>

#INCLUDE "../equivalencies.txt"

$PR = textdetect.pr
$GTF = all.target.xml
$RDF = all.candidate.xml

$NAME = frame-highQ
* <frameMetrics> = matchedpixels missedpixels falsepixels fragmentation arearecall areaprecision [arearecall .7] [areaprecision .7]
* <filter> = READABILITY: == 4 || == 5

#NAME_GRAPH frame-highQuality


The current implementation of makegraph requires the first three frame metrics exactly as shown here. The others can be modified, but only slightly. These are the same metrics as recorded in Rangachar Kasturi's paper on the subject, which was used to the VACE evaluation of text detection.

Interpreting Results

The evaluations will return a set of several charts, as well as some text based output. The charts include, among others:

Tracking Use Cases

These are cases where objects are moving around on the screen, and the algorithm is trying to record where different objects are going

Text Tracking

In order to do good OCR in video, it is often necessary to fuse multiple frames of video together to provide what is known as superresolution enhancement. The text can then be interpreted with greater accuracy. Also, if text scrolls off or onto screen, or is occluded, it may become necessary to combine multiple runs of OCR into a single text entity. Key to these steps, in current systems, is text tracking.

Like text detection, text tracking evaluation is focused on testing a module in a larger system - presumably a system for extracting text from video. The evaluation should identify be how well does it finds text that is the same (meaning the same string) from frame to frame.

Preparing the Data

The data preparation may be similar to that of text tracking, although the boxes will be dynamic attributes. The text will be static, and changes in the text value will result in new objects. This makes the idea of switching to word-level evaluation more compelling, especially for things like marquee tickers on CNN.


The framewise evaluation will run in similar fashion to the method described for text detection. The object evaluation may benefit from both a SINGLE match as well as the multiple matching, to see how much fragmentation of tracks there is. Another possiblity is to use the keyed tracking evaluation, using the boxes and string values as keys, or explicitly marking the keys, to show how well objects keep their track. This is less important for text tracking; goal-based evaluation may be more appropriate.

Person Tracking

There are a variety of systems that require person tracking. For our purposes, we will focus on surveillance situations (maybe sporting events would be an interesting case, too?). This means not only tracking people as they move around, but tracking individuals in groups and correctly identifying the same individual at different times or from different angles or in different lighting conditions.

Preparing the Data

As discussed above, there will be one object for each person that a human editor can recognize as the same person in the video. It would be even better to have a script which performs better than humans, as there are techniques that computers may apply that may offer better results for certain kinds of data where average humans may fail. If the evaluator decides not to test for identification, the convert script can be used to cut objects into seperate objects for each set of contiguous frames.

The most difficult part is selecting a visual representation for a person. Possiblities include:

One possibility, and the one we have chosen in the past, is to place a box around the torso and another around all visible extremities. This will allows the area between the torso to be regarded as important and the rest of the area to be regarded as good to have, but not required.


For evaluation, it is possible to perform the metrics as described for text tracking, and these are useful. However, there are a few results that would be appropriate for this data beyond those. One approach is to use the region between the torso and body boxes as a don't care region during object evaluation.

Activity Use Cases

These are cases where a large collection of video is checked for higher level semantic content, such as thefts, handoffs, and other interesting events.

Warehouse Monitoring

Groups Alpha and Beta are trying to develop software that monitors warehouses using digital video cameras. Team Alpha is focused on person detection and uses a rule based system to turn person tracks and object tracks into sets of activities. Group Beta uses a more statistical approach, skipping object detection and tries to turn segment trace information into activity information. As such, the two teams have different goals, and very different ontologies.

Group Alpha's software generates a very rich set of data, with information regarding person tracks, and which person is standing where. The rule system then attempts to use colorometric data to equate the different person tracks into one person. After that, it can detect if a person is getting something, if a person is sleeping on the job, when a person transfers an object, and so forth. The major activities detected include idle time, transfer events, sleeping, running, and alarm state.

Group Beta's software is much more opaque, generating only probabilities for each of five different states: idle, active, unusual, theft, and fire.

There are two major reasons to use ViPER here: to benchmark and improve an individual system, or to determine which system best suits the customer. Teams Alpha and Beta could each develop ground truth that is similar to their output, and use ViPER to mark performance improvements as they tweak the parameters. From the customer's perspective, what is important is that the software works well on her problems.

Group Alpha may find it useful to test the individual modules in their system with ViPER as described in the detection and tracking use case scenarios. Group Beta might not have as much use, although they can follow a similar scenario to that described below, but more tempered to their ontology. Since Group Beta likely has probability data, they may wish to do a more standard ROC curves.

The customer has two different uses for the software: real-time enhancement of existing monitoring systems, and for use to collect aggregate data. It makes the most sense to develop a set of ground truth that accurately reflects her situation, preferably using existing surveillance footage. She may also stage some scenarios.

For real-time enhancement, the main goal is to bring interesting events to the eyes of night watchmen and floor supervisors. This means there are certain key events - small segments of videos with a certain label, from the point of view of the system - that should be detected. This includes: unauthorized pick-ups, breakage, and work stoppages, as well as entrance and exit of unauthorized people.

For aggregate data collection and mining, the customer wishes to catelogue times when employees are not active, number of accidents, and other activities whose automated recording feels vaguely orwellian.

Data Collection

As mentioned above, a selection of video from the actual warehouse is important. This can be acquired from surveillance video archives or generated. One possiblity is generating fake data using actors. This may be the only way to capture fire, theft, unauthorized entry, and accident footage.

The simplest method for markup should use a well defined ontology for activities. It might be nice to have a heirarchical ontology, but the evaluation does not require this (it is possible to specify equivalence classes for activity types using the equivalency file). This would divide activities into interesting and uninteresting. Interesting events may have 'critical sections,' e.g. the moment a package is lifted during a 'theft' activity, or when a person leaves or enters during a motion activity.


There are two distinct evaluation types here: critical event type and aggregate type. The critical event type place an emphasis on detecting critical sections, while the aggregate type emphasize tabulating the appropriate seconds over time. Either one could be viewed as a special case of the other. For our purposes, they will be treated seperately, although the implementation may be very similar. One thing that will be the same is the treatment of equivalencies. For the event, we need thefts, which will be defined as 'transfer events at night' for team alpha. Accidents will be considered together with fires and unusual events.

Critical Section Evaluation

The ground truth for these events will have the critical sections marked up, as described above. This is similar in premise to the way the torso box may be treated in the person tracking evaluation. Simply put, activity detections of the appropriate type that strictly include the whole critical section will count as a match. Activity detections that go outside the non-critical region of the activity will be penalized.

Say each activity must be a contiguous section of time, and may contain n intervals that are considered 'critical'. An activity detection may be regarded as perfect if it includes all the critical sections of the matching activity and it does not extend outside the activity. These 'perfect matches' could be ranked by recall of the area, but we'll leave that aside for now.

This means that the non-critical sections of activity may be regarded as don't-care regions, and essentially ignored. The critical sections may be viewed as adjacent. This reduces the problem to that of a time-based segmentation evaluation. Since we require all segments to be matched, if we do not allow splits, we can perform a simple evaluation using the 'extent' metric. This metric penalizes difference in extent. We can add an additional constraint that the target be within the candidate.

To handle merges, it is possible to activate aggregation. Since it is a requirement that target activities be contiguous, only adjacent target activities may be merged into one. However, candidate activities may be merged if they are adjacent in 'evaluation space' - meaning they are adjacent or seperated by don't-care frames.

Something that might be important is 'slightly missed' critical sections. It may be beneficial to reward algorithms that return activities near the true time for activities. This may be necessary in the event of lossy ground truth (not uncommon, especially for real-time or super-real-time ground truth authoring). This can be acheived by running low-pass filters over the data appropriately, or, rather, over the metric function.

Continuous Activity Evaluation

There are several techniques for continuous activity evaluation. The most simple is precision/recall metrics for each frame. An improved method is to use ROC analysis, but this is not a viable option in the case of Team Alpha. Since we have the information, a possible solution is to use segment analysis to determine object matching through the OBJECT evaluation as described above. Another is to use a 1d segmentation evaluation.

For this evaluation, we should use two different metrics: the modified Lafferty metric described in Allen et al., and a simple aggregate, goal-based metric in the end, seeing how well the sums work out. The first method can get results with a little more reliability than the other methods. The second method more concretely answers the question, but is less likely to return significant results for an experiment.

Conference Room

There are a variety of activities that take place in a conference room. For most of its existence, a conference room lies dormant, with all the lights off. When the lights are on, there can either be people inside or there can be crews managing and cleaning the space. What are most interesting, from the perspective of video processing, are the conferences themselves, when the overhead lights or the projectors are switched on, and what goes on in them.

Most of the footage we currently have of conferences consists of talking heads before a whiteboard or screen. There may be some interaction with the audience. From a high level perspective, the conference is a series of seminars, lectures and presentations given by individuals or small groups.

At the most granular level, it would be useful to get an idea of the content of the whole conference. At a slightly finer grain, it would be useful to segment the conference video into individual presentations, and then classify the presentations. The presentations can be divided by style (lecture v. Q & A; PowerPoint v. white board) and by content (thesis, keywords, etc). It may be necessary to further segment talks in to different sections, based on what slide the presenter is on, who is speaking, or some other information.

Finally, at the finest grain we will consider here, it may be useful to do things like word/utterance segmentation, object tracking, person identification, and character recognition. These may all be used in support of the higher levels of activity detection, or may be used directly for things like categorization. However, the quality is likely to be poor enough to make these unusable as transcripts, and these also get away from our goal of activity detection.

What we would like is an 'unsupervised tivo,' a system that watches a conference and logically breaks it down into heirarchical sections. The system classifies the sections using some sort of conference ontology. The ontology may include speaker information, topic, style of presentation, or another method of labeling.

Outdoor Surveillance

Most of the outdoor surveillance footage we have is collected from a set of cameras located around the building. Most footage we have includes people walking around, getting in and out of cars, and moving packages. There is footage of some thefts and some phone calls. The idea of activity detection in an outdoor forum is very open ended. We could easily capture footage of cricket players and picnickers in the courtyard, for example. Since we only have cameras on some of the enterances, we can't track the comings and goings of all, and probably won't try, although we may attempt to catalogue arrivals/departures of some individuals who give consent. To avoid such problems, we are going to focus, for this use case, on thefts, running, package delivery, and otherwise suspiscious or noteworthy behaviour.

We want to detect these activities, and determine how well we've detected them. Usually, such activities are pseudo-heirarchical: we detect a person going in, and that person is running, and that person is named Jerry. What does it mean when the processor says that Jerry is skipping? The detection of an 'entering the building' event is still correct, but the more fine grained activity was a false detection. However, should this count as false as well - two seperate events, one correct, one false, and one missed? An alternative would be to first check at the highest level, assign a score, then apply evaluations to finer and finer grains.

Unlike the other use cases, here it is important to know how many people are walking around, how many planes are taking off, and so forth. More information is required than 'an activity is happening at this frame;' it is necessary to know that 'an activity of this type occurs during this interval or set of frames.' Multiple activities of any type may be occuring at the same time, and .[todo ]