Video Performance Evaluation Resource

Performance Evaluation Manual

Introduction

The goal of the Performance Evaluation component of the Video Performance Evaluation Resource, or ViPER-PE, is to take the result data from some analysis, compare it to some ground truth, and to produce some useful data describing the success or failure of the analysis, hopefully in such a way that results are repeatable, valid and comparable. This document describes the goals of the ViPER-Performance Evaluation software, how to use it, and how to interpret its results. The three different major types of analysis are object, framewise and tracking.

ViPER-PE's Performance Evaluation Methods

The Problem of Performance Evaluation

Performance evaluation is required for empirical research into video analysis. Without performance evaluation, researchers and users cannot measure improvement over older methods or determine the best algorithm for a task. Good performance evaluation can provide baselines, deliverables, and other artifacts useful not only for the good of science, but also for securing grants and funding.

Unfortunately, performance evaluation of video is inherently difficult. It is labor intensive, often requiring the creation of custom tools and the generation of ground truth. The custom tools may contain flaws, as they are often not given as much attention as required. The ground truth itself is more likely to contain flaws, especially for video, where generating ground truth is thankless and repetitive. It is often subjective, with metrics and data sets tailored for the tested algorithms. The published results may not be statistically valid or correct, and the reader is rarely given the tools to verify results.

How ViPER Helps

Essential components of an evaluation system include:

1) A standard method of representing analysis results and ground truth. This is accomplished with the Ground Truth file formats described in the `ViPER-GT Manual`_. This will allow publication of ground truth and algorithm output, allowing interested parties to perform evaluation using their own tools, or directly compare a different algorithm to one included in a paper.

2) A system for configuring an evaluation system. ViPER uses the Evaluation Parameters File (EPF) in combination with a set of properties to configure a specific evaluation. The EPF tells the system how to perform an evaluation of candidate data against a target, and is readable enough to let a reader know what an evaluation means while providing a method for duplicating the experiment.

3) A tool to perform the evaluation. This is accomplished with the viper-pe java-based command line tool.

4) A meaningful way to present results. ViPER-PE provides two output formats: a human readable text file, and a space delimited one for machine reading. The makeGraph tool can convert sets of the machine-readable output format into graphs, while the RunEvaluation tool can compare several sets of experiments. For information about display, see the `Scripting ViPER Manual`_.

With the above four items, ViPER provides repeatability, comparability and the ability for an experimenter to determine validity. Users and researchers in a given field of video analysis can decide on a ViPER-PE configuration; this will allow duplication of the experiments. Used properly, ViPER-PE can track progress during development of a video understanding system, compare the features of existing systems and otherwise help evaluate the state of the art. For detailed examples of these scenarios, refer to the `ViPER Case Studies Manual`_.

Performance evaluation involves comparing generated data, called the candidate or result data, with the ground truth, or target, data. They are called this since any given algorithm is trying to generate the target data as accurately as possible, producing a set of candidate objects, or descriptors. Given sets of candidate and target data, with appropriate configuration options, the Performance Evaluation tool generates a description of how well the candidates (results) match the targets (truth) given the parameters.

Given that the field of video understanding is wide open, little agreement exists upon a proper set of performance metrics. As such, ViPER-PE includes several different types of analysis, each of which is configurable. The three analysis types included in ViPER are object-matching, framewise comparison and track comparison.

Object Analysis: Matching Candidates to Targets

Object analysis attempts to match candidate objects to target objects. It attempts to answer the question, How close are two descriptors? Using this idea, it determines which targets and candidates are close together, and then generates the precision and recall of the number of candidates matching targets. This involves defining some distance space for descriptors, preferably a metric one. For ViPER-PE, all distances are normalized between zero and one. A complete listing of the available metrics is included in the appendix.

Object Analysis is divided into multiple phases: detection, localization, statistical comparison and target matching. Each phase is a finer level of analysis. Detection determines that some object was found in a set of frames for each target. Localization and statistical comparison make sure that a possible match has similar attributes. Finally, the target matching phase checks many-many candidate-target pairings, dealing with split and merged objects.

Note: It is possible to short-circuit the evaluation and apply target matching without doing the statistic comparison or even the localization phase. You may set this with the level property, as described in the appendix.

Metric	Definition
Equality	Two attributes are close if they are equivalent.
Dice coefficient	Twice the shared area over the sum. This avoids the asymmetry of the Overlap metric.
Target Overlap	The fraction of the target that overlaps the candidate is used as the distance.
Maximum Deviation	The maximum of either the target or candidate overlap distance.
String Edit Distance	Strings that require more atomic edit operations (insert or delete character) are farther apart.

Detection

In the Detection phase, ViPER-PE looks at descriptor types and the frames in which the objects occur. If a candidate overlaps the appropriate number of frames as any target descriptor, then both candidate and target are counted as detected. The chosen distance metric and tolerance define the number of frames required for a pair to count as a detection. The metric, which defaults to a dice coefficient, determines how to compute the distance between two frame spans. If the computed distance is less than or equal to the tolerance, the pair is detected.

Detection could very well be enough analysis for an application. For example, a program that counts the number of faces on a screen does not need to check to see where they are. Recognition, frame detection and object counts all require no more analysis than this. By setting the level of evaluation to '1', the frame distance from the detection phase can be used for the target matching phase.

Localization

The second phase, localization, works much like detection. Instead of looking at all frames of the objects, it compares only those frames where the attributes are similar. Attribute similarity, like frame range similarity, is user-defined. For a given candidate-target pair, dissimilar frames are counted as missed or false in the range distance calculation. Each attribute distance is calculated separately in the properties and the evaluation parameters file (EPF).

For example, when localizing faces, it is common to use a dice metric on the face bounding box with a tolerance of about 0.5. If we also wanted to check the identity of the face, using a string attribute Ã'NameÃ", we would also add an equality metric for that attribute. For a possible match, only frames with the appropriate face name and meet the dice threshold will count as similar. The frame range metric will be computed as if the dissimilar frames that did not overlap. The value of this localized frame metric is now the value of the distance between the target and candidate descriptor.

Statistical Comparison

The third level is statistical comparison. From the localized matches, it computes the average, median, minimum, or maximum distance, and then thresholds that value against another tolerance value, specified as the level3_tol property. This is useful for descriptors that cross several frames. The distance between the descriptors is then set as this property.

Target Matching

After these different measures, ViPER-PE has a list of candidates that match targets. Many candidates may match a single target, and vice versa. While this may be perfectly acceptable, it is often not enough to leave it at this. There are many situations where having two objects match one do not make sense. Allowing only one candidate for each target is a possible solution. For many cases, such as text detection, a scheme of splitting or merging descriptors so that only appropriate multiple groupings are made may be preferable. Since both have their use, ViPER-PE provides both methods.

It should be noted that either method requires an ability to rank target-candidate matches. While the previous steps only generated a distance measure for individual attributes, target matching requires a single distance measure for a candidate-target pair. For detected and localized pairs, the distance is the frame range distance. If ViPER-PE was set to use statistical comparison, the match distance is calculated using the average of the statistic for each evaluated attribute, with missed and falsely detected frames are counted as having a distance of one.

For one-to-one target matching, it would be best to minimize the sum of the distances over all targets and candidates. Set the Ã"target_matchÃ• property to SINGLE-OPTIMUM for the optimum search. However, sometimes minimizing the sum of distances doesn't make sense. A simpler solution, SINGLE, just grabs the best matches in order.

Setting the Ã"target_matchÃ• property to MULTIPLE will aggregate descriptors together. The aggregation algorithm is iterative. It examines targets that match the same candidate, then aggregates them if aggregation would improve the distance. Then, it looks at candidates that match the same target and repeats. This iterates until no improvement in the distance is gained. Aggregation is only defined for some of the attribute types. Aggregation of frame spans and circles are unions. Boxes, oriented boxes, and, in the near future, polygons can be combined with each other, also. In order to make the output prettier, strings will be concatenated, but this is unlikely to be the appropriate aggregation solution; a better technique would perhaps take a bag of words.

Framewise: Evaluation on the Frame Level

It may be apparent now that object analysis is not a panacea. It is very complex, and often does not produce eminently usable data. Framewise analysis is in direct response to these issues. Instead of looking at objects across frames, it examines them frame-by-frame. It generates a list of metrics for each frame, including pixel precision and recall. It supports most of the metrics defined in Mariano, Park, Min and KasturiÃ•s paper on performance evaluation; the tracking evaluation provides those that it does not. The framewise evaluation may also include pixel match counts.

For each descriptor evaluation, the framewise evaluation always includes object count metrics. First, don't-care regions of the image are carved out. Descriptors that fall completely within these regions are ignored. Don't-care objects are selected using the 'Output Filter' section of the EPF file. Candidates that match to individual don't-care descriptors using the angle-bracket measure, if one is specified, are also ignored. Then, the metrics are computed.

Object accuracy is twice the minimum of the candidate and target counts divided by the sum of the candidate and target counts. Object recall and precision measure how well the numbers of object match in each frame. Note that for the totals over an entire video count each occurance of an object in a frame as a seperate object.

[TODO: Describe localizers and other metrics]

Tracking: Evaluation Given the Starting Frame

Tracking Analysis assumes that there already exists some mapping between candidates and targets such that no one candidate or target is mapped to more than one descriptor. The simplest way to achieve this is to associate a unique attribute with each descriptor in the ground truth, and use that to identify the candidate descriptors. An experiment here will usually involve distributing first-frame data sets, generated using gtf2gtfÃ•s Ãclip option.

The `viper-pe` Tool

Installation

For information about how to install ViPER-PE, refer to the Quick Start Guide.

Using the `viper-pe` Tool

The viper-pe command takes a series of properties to configure the system. These can either be specified on the command line using their short names or property names or in a properties file. The command needs at least two properties, the target file and the candidate file, specified with the Ãg option and the Ãr option, respectively, although the analysis will likely be garbage without an evaluation parameters file and at least the target_match property set.

System Message: ERROR/3 (<string>, line 290)

Malformed table. Text in column margin at line offset 1.

=================== ============= =====
     Frequently Used Properties
   For Complete List, See Appendix
---------------------------------------
Command Line Switch Property Name Value
=================== ============= =====
-g                  gt_file       The file name of the truth file (the file 
                                  containing the target data set).
-r                  results_file  The file name of the results file (the file 
                                  containing the candidate data set).
-epf                epf_file      The evaluation parameters file, including 
                                  equivalency, evaluation, and filter information.
-o                  output_file   Where to print the human readable output data. 
                                  Defaults to standard output. Set to `-` for 
                                  standard output, and the empty string for none.
-raw                raw_file      The file to receive the raw data output. 
                                  Defaults to none. Set to Ã"-` for standard 
                                  output, and the empty string for none.
-P<propertyname>                  Specify any property by its long name.
-pr                               The file name of the properties file.
                    target_match  If using an object evaluation in your EPF, 
                                  this parameter specifies the methods of 
                                  object analysis: ALL, SINGLE, SINGLE-BEST, 
                                  or MULTIPLE.
=================== ============= =====

Given a valid set of properties, viper-pe will generate readable output in the file specified in the Ã"-oÃ• option and machine readable data in the file from the Ã"-rawÃ• option. It will display error messages to the systemÃ•s error stream, and informative messages to the stream in the -l option.

From a Unix install, you should be able to invoke the command viper-pe from any location, assuming you have set the PATH variable by sourcing the viper.config file from csh, or dotting the viper-config.sh script from sh. From Windows, you will have to write a batch file, or invoke the slightly more complex command java Ãjar viper-pe.jar. For example, a simple command would be:

viper-pe Ãpr textdetect.pr -epf  dice-graphic.epf -g all.gtf.xml 
         Ãr UMD/RDF/all.rdf.xml -o dice-graphic.out -raw dice-graphic.raw

Setting the Evaluation Parameters

The evaluation parameters file is given as a list of sections, delimited with #BEGIN_<section> and #END_<section> lines. The sections are EQUIVALENCE, EVALUATION, and the four FILTER sections. The Equivalence section matches names in the target data to names in the candidate data; the Evaluation section specifies both the type of evaluation and how the results will be computed; and the Filters specify which descriptors to evaluate and which ones to ignore.

Equivalence

One common issue during evaluation is a disagreement between the descriptor and attribute names of the candidates and targets. Even worse, several different candidate descriptor types may match one target type. The Equivalence section addresses the problem with a simple list of matches. It is possible to map a single target name to multiple candidate names on one line; simply place all of the candidate names in a space delimited list on the right side of the colon. If you want to map multiple target names to a single candidate name, or a list of candidate names, the same line must be repeated with different candidate names.

Evaluation

There are three possible evaluation sections; there is one for each type of evaluation. While they look similar, each has a slightly different layout. All three operate on the ground truth names for the descriptors and attributes, and work by listing the descriptors to evaluate, with a list of metrics following each line. Attributes and descriptors that are not listed in the EPF file are ignored.

Evaluation Type	Sample Evaluation Block
Object Evaluation	OBJECT Descriptor1 [dice .99] AttributeAlpha : [maxdev -]
Framewise Evaluation	OBJECT Descriptor BBOX : dice [arearecall .6] [areaprecision .6] \\ <areaprecision .6>
Tracking Evaluation	OBJECT PERSON BBOX : dice extents [maxdev -] * NAME

Object Evaluation

The Object Evaluation section specifies which descriptors and attributes to evaluate. It is a list of the descriptors to evaluate, and information about how to evaluate them. After each attribute name is a colon and a square-bracketed metric-threshold pair. The metrics for each attribute type is listed in the appendix. It should also be noted that the frame span gets a similar treatment, with its pair placed on the first line, after the name of the descriptor. The frame span metric and threshold are used for the matching and localization steps, with the attribute metrics and thresholds used for the statistical localization and target matching steps. Dashes may be used to indicate the default metric or threshold level.

The evaluation is performed as specified in the section on Object Analysis from section 2 of this manual, using the level and target_match properties to determine the level of analysis, such as detection or statistical comparison, and the matching heuristic, such as MULTIPLE or SINGLE_BEST.

For example, the object evaluation listed in the above table is designed to match target descriptors to candidates that overlap somewhat in time and have a maximum deviation distance that for the AttributeAlpha attribute that is less than the default for that data type. The evaluation will first match all Descriptor1 candidate-target pairs that overlap in time with a dice distance of less than or equal to .99; this will be most pairs that share a few frames. ViPER-PE will then check each frame of each matching; for each frame of a match, if all attributes, in this case AttributeAlpha, are not above their thresholds for the given metric, the frame counts as a miss, and the dice metric is evaluated again on the two frame spans to determine validity of a matching. The third level, statistical comparison, takes the statistic set in the stat_metric property over all the frames of each remaining pairing and compares the average over all attribute results of this to the stat_tol property. The matching is then performed as set in the target_match property; if set to SINGLE-GREEDY or SINGLE-BEST, some pairings are cropped, and if set to MULTIPLE, some pairings are merged with some pairings that are deemed unnecessary are dropped. Precision and recall are then calculated using the remaining pairings.

Framewise Evaluation

Framewise evaluations follow each attribute line of the descriptor you wish to evaluate. Unlike Tracking or Object, there is no metrics for the frame span. A FRAMEWISE_EVALUATION takes three kinds of metrics: normal metrics, which just return a distance; localizers, in the form [metric threshold], which return a precision/recall; and don't care matchers, in the form < metric threshold >, which mark the candidate as don't care if it matches any given target that is marked as don't care by the ground output filters.

The distance metrics vary for each attribute. These include dice, areaprecision, and arearecall for shape attributes, as well as e dit distances for string attributes. For each frame, one number is returned for each of these metrics. This number is either an average or a sum for each descriptor in a frame, depending on the definition of the metric. For example, the "matched" metric returns the sum of all matched pixels in a frame, while "fragmentation" compares each target shape to all candidate shapes and returns an average over all targets in the frame.

Localizers are represented as [distance threshold] pairs in square brackets. These work by taking a distance for each candidate to the targets or vice-versa, and counts the match as successful if the distance is either above or beneath the threshold, depending if the distance is actually a distance, like dice, or a similarity measure, like areaprecision or arearecall. The number returned is a ratio of correctly matched to number of possibles.

The third possibility, in angle brackets, is used for ignore filtering, which is described more fully in a later section. The basic idea is that certain target descriptors may be marked as "don't care". This is fine for distances, but for localizers and the three constant metrics, object count accuracy, precision and recall, there needs to be some way to determine that the descriptor and not just some or all of its value should be ignored. Viper-pe evaluates the metric for all candidates against each ignored target; candidates whose distances or similarity measures fall below or above the threshold for some specific ignored target are themselves ignored. Note: currently only one filter metric is accepted per line.

The first thing that occurs is the marking of regions as don't care. All target and candidate regions that are marked as ignored by the OUTPUT_FILTERs are marked as don't care, as are all candidates that return an area precision of greater than .6 for a target that is to be ignored. Object count accuracy, precision, and recall is determined as the count ofall descriptors that are not ignored. Then a dice coefficient is run between the set of all candidate and target pixels less those marked as ignored. For each non-ignored target, it counts as localized by the first localizer if more than 60% of its region is recalled by the union of candidate boxes; the ratio of these targets to all possible targets is the localized area recall. Each non-ignored candidate counts as localized by the second localizer if it is precise enough; that is, it more than sixty percent of it (not including ignored regions) is uncovered by target boxes; the ratio of localized boxes to all boxes is the result of the third metric.

Tracking Evaluation

Tracking Evaluation has a complicated evaluation format as well, as it must take into account what attributes to mark as the key attributes, as well as how to handle the fallback case when attributes have no keys. It contains all the elements of the object evaluation section; in this case, the object evaluation is used in the fallback situation. In addition, it accepts lists of metrics; these are used for the tracking evaluation itself. Finally, the key attribute is marked with an asterisk before it. It does not need any metric information after it.

Filtering

Often when using a given set of ground truth, it is preferable to only examine a subset of the data. For example, it is often instructive to compare how well an algorithm performs on text above a certain size, or objects that are not occluded. The Evaluation section specifies the metrics to use and which descriptor types to evaluate, and the Filter sections provide precise control of which descriptors to evaluate.

The Filter sections are divided two ways, into input and output filters, and candidate and target filters. Input filters prevent some descriptors from being read into ViPER-PE; as far as ViPER-PE is concerned, descriptors that do not match an input filter (if one is given for that descriptor type) do not exist. The output filters specify that they, and any matching descriptors, are not to be output; they are also called Don't Care filters. Each different attribute type has a different set of possible filters. For a complete list, see the appendix.

Input filtering makes it appear to the evaluation program that the specified Descriptors are simply not included in the data. Since the evaluation will only occur on descriptors that you specify in the EPF file, leaving them out will skip them as well. The rule system was developed to skip items based on their static attributes. If the attribute is dynamic, the attribute counts as passing if any of the values it takes passes the filter. A descriptor passes a filter if each of its attributes passes. The frame span can also be filtered, as demonstrated in Figure 4.

Since input filtering simply prevents descriptors from being included in the performance evaluation, it results in odd evaluations. Output filters provide a more intelligent, if more processor and memory intensive, approach. Instead of filtering descriptors during parsing, the output filter acts after the computations have been completed. For Object Analysis, matchings that involve a filtered object are not counted towards the precision and recall, and are not printed out as detected. In Pixel Analysis and Tracking Analysis, these objects are treated as Ã'DonÃ•t CareÃ" objects; the regions they cover are not included in many of the pixel metrics. For a complete understanding of the Ã'DonÃ•t CareÃ" methodology, and how to choose the Ã"DonÃ•t Care Threshold,Ã• see [Mariano].

Appendix

Terminology

Candidate data: Also known as result data, this is the set of descriptors generated by some algorithm that will be compared to the target data
Detection: An object is classified as Detected if one of its type is found on the frame. For example, if you want to retrieve all frames containing faces, you may find that Detection is the only required depth of analysis.
Results data: See Candidate data
Target data: Also known as truth data, this is the set of descriptors that represent the true content of the media file
Truth data: See Target data.

Property Names and Command Line Arguments

Frequently Used Properties

System Message: ERROR/3 (<string>, line 576)

Malformed table. Column span alignment problem at line offset 4.

=================== ================== =====
Command Line Switch Property Name      Value
=================== ================== =====
-g                  gt_file            The file name of the truth file (the file 
                                       containing the target data set).
-gc                 gtconfig_file      **Deprecated** For use with the old gtf 
                                       file types, this specifies the schema file
                                       for use with the ground truth data file. If
                                       no schema is associated with the candidate file,
                                       this is used for that file, as well.
-r                  results_file       The file name of the results file (the file 
                                       containing the candidate data set).
-rc                 resultsconfig_file **Deprecated** For use with the old gtf 
                                       file types, this specifies the schema file
                                       for use with the candidate data file.
-epf                epf_file           The evaluation parameters file, including 
                                       equivalency, evaluation, and filter 
                                       information.
-o                  output_file        Where to print the human readable output 
                                       data. Defaults to standard output. Set to 
                                       `-` for standard output, and the empty 
                                       string for none.
-raw                raw_file           The file to receive the raw data output. 
                                       Defaults to none. Set to `-` for standard 
                                       output, and the empty string for none.
-b                  base               The base file name for all the other files. 
                                       Sets all of the above to this stem, except 
                                       for the configuration files. The file suffixes 
                                       are .gtf, .rdf, .epf, and .out. For example, 
                                       setting Ãb temp will load the target data 
                                       from temp.gtf, candidate data from temp.rdf, 
                                       set the evaluation parameters from temp.epf, 
                                       and output data to temp.out.
-L                  level              For object analyses, sets the type of 
                                       comparison to do. See the object analysis 
                                       section for a complete description of the 
                                       following options:
                                       
                                       1 = Detection: The objects overlap temporally 
                                       better than some threshold. (See range_tol, 
                                       rmetric_default)
                                       
                                       2 = Localization: Performs detection using 
                                       only the frames whose attributes meet necessary 
                                       thresholds.
                                       
                                       3 = Statistical Comparison: The average/ 
                                       median/max/min, depending on which was 
                                       selected, meets a certain threshold.
-P<propertyname>                       Specify any property by its long name.
-pr                                    The file name of the properties file.
                    verbose            Specifies longer form of output
                    target_match       If using an object evaluation in your EPF,
                                       this parameter specifies the methods of 
                                       object analysis: ALL, SINGLE, SINGLE-BEST, 
                                       or MULTIPLE.
                    attrib_width       Specifies the number of columns in the output 
                                       file before it will clip the attribute 
                                       information.
                    range_metric       The default distance metric to use for the 
                                       frame overlap test.
                    range_tol          Comparisons that indicate the distance 
                                       between two descriptor's frame span is 
                                       greater than this will be dropped.
                    level3_metric      The statistic to use when doing statistical 
                                       localization.
                    level3_tol         The tolerance to place on the above measure 
                                       for each attribute for descriptor 
                                       target/candidate pairs to count as localized.
                    <data type>_tol    The default tolerance for a given attribute 
                                       type for object localization. For a list 
                                       of metrics for each type, see the table 
                                       in the Object Analysis section.
                    <data type>_metric The default metric type to compute 
                                       attribute distance during object analysis.
=================== ================== =====

Filter Types

System Message: ERROR/3 (<string>, line 657)

Malformed table. Text in column margin at line offset 9.

===================== ===================================== ========================
Rule                  Description                           Applicable Attributes
===================== ===================================== ========================
&&                    And                                   Any other rule
||                    Or                                    Any other rule
==                    Equivalence: The attribute must be    All
                      equal to the given value.
!=                    Non-equivalence: The attribute must   All
                      not equal the specified value.
>, >=, <=, <          Relational values. The attribute is  dvalue, fvalue, svalue
                      either greater than, greater than or (strings are put in
                      equal to, less than or equal to, or  lexicographic order)
                      strictly less than the specified 
                      value.
contains              The rule value is completely covered Set values, e.g. frame span
                      by the attribute value.              or polygons
intersects            The rule value and the attribute     Set values.
                      value share at least one pixel, 
                      frame, element, etc.
excludes              The rule value and the attribute     Set values.
                      value share no frames/pixels/
                      elements.
===================== ===================================== ========================

Metrics

Metric	Attributes	Definition	Formula
E	All Attribute Types	Equivalence	0 if target equals candidate, 1 otherwise.
dice	Frame Span, Shape types	Twice the shared area, divided by the sum of the two areas.	1-2*sz(T^C)/(sz(T)+sz(C))
overlap	Frame Span, Shape types	The fraction of the target that the candidate overlaps.	1-sz(T^C)/sz(T)
maxdev	Frame Span, Shape types	The maximum either of the candidate or target deviation.	Maximum( sz(C-T)/sz(C) , sz(T-C)/sz(T) )
L	String Value	Normalized Levenshtein (Edit) distance	E = Edit distance. Normalized using normalization factor alpha as follows: 1 - exp(-alpha * E)
H	String Value	Hamming Distance	1 if length is different. Otherwise, for D = number of characters that are different, L = Length, returns `D/L`
euclidean	Point	Normalized Euclidean distance	E = Euclidean distance, normalized as follows: 1 - exp(-alpha * E)
manhattan	Point	Normalized Manhattan distance	M = Euclidean distance, normalized as follows: 1 - exp(-alpha * M)
difference	Numerics	Normalized difference	For D = abs(C Ã T), distance is 1 - exp(-alpha * D)

File Formats

The video data file formats are described in the `ViPER-GT Manual`_.

Properties File

The properties file is simply a list of <property> = <value> pairs, with optional comments, that run from a "#" to the end of the line.

Evaluation Parameters File

The Evaluation Parameters file defines three items:

Equivalencies:: Marked with #BEGIN_EQUIVALENCIES and #END_EQUIVALENCIES, the equivalency section defines the mapping between names in the ground truth file and names in the candidate data file. Each line is of the form <truth name> : <candidate name>
Filters:: The input filters determine what data is parsed, while the output filters determine which data is used in the final calculations. The four properties section are delimited with #BEGIN_ and #END_ markers as well, in this case GROUND_FILTER and RESULT_FILTER for the truth and candidate input filters, respectively, and GROUND_OUTPUT_FILTER and RESULT_ OUTPUT_FILTER for the truth and candidate output filters. Inside the section delimiters, the filters are a list of descriptors in format, with a colon and the filters following each line.
Evaluation:: The Evaluation section selects which descriptors and attributes to examine. Each block is a descriptor. Descriptors and attributes that are not listed are not evaluated. For object analysis, include [metric tolerance] information after the attribute. The frame range metric and tolerance is set after the main descriptor line.

Note that C++ style (//) comments are acceptable.

Human Readable Output File

Divided into two main sections, it first describes the parameters for the evaluation, then displays the results. The "INPUT PARAMETERS" section lists various properties. The four "FILTER" sections display what, if any, filters were used. Finally, the "METRICS" section includes the Descriptors and Attributes that are to be evaluated, including their metric and tolerances. The input parameters establish how the file was generated, and the repeat of the parameters are also good for troubleshooting possible error in file format or understanding of how to set the parameters of an evaluation.

For Object Analysis, the rest of the file describes which level a FALSE or MISSED descriptor got to, and then lists the detection matches. If a descriptor is counted as false or missed on level 0, that means that no other object of that type occurs in a given file. If a descriptor falls out at level 1, this descriptor did not meet the frame range metric and tolerance restrictions for any other object of the same type. Level 2 indicates localization failures, 3 are statistical failures, and 3c are descriptors that matched an object not as well as another, as described in the given type of target_match.

The Pixelwise output gives a frame-by-frame commentary, listing several properties for each frame. Briefly, these are the number of pixels matched in the frame, the number missed by the candidate set, the number the algorithm detected mistakenly, a pixel accuracy and an object accuracy, a fragmentation measure, and object, average box, and localized box precision and recall.

The Tracking metric lists results for each truth object, listed by input file and id number. These numbers are temporal precision and recall, positional accuracy, size accuracy, and angle difference.

It should be noted that all three of the types of evaluation end with a summary of some sort. Object Analysis returns precision/recall data, Pixelwise gives a total, and Tracking gives various coefficients. If these are not printed, then there is a flaw in the code, in the input parameters and data, or the computation requires too much memory to finish.

For examples, see the case studies.

Raw Output File

The raw file is a list of space and line delimited numbers describing the results of an evaluation. Each raw file also includes information about how the evaluation was performed. Like the Evaluation Parameters file and the old Ground Truth file formats, #BEGIN_<SECTION> and #END_<SECTION> markers divide the raw file format is divided into sections. It may include C++ style // comments.

PARAMETERS:

A set of <property> = <value> pairs. Currently, this includes config_file, gt_file, result_file, epf_file, log_file, output_file, and level.

GROUND_FILTER, RESULT_FILTER:

The input target and candidate filter settings, respectively.

METRICS:

A list, basically corresponding to the EVALUATIONS section of the EPF file. It is of the form:

<Descriptor> <metric> <tol> \n *( \* <attrname> <metric> <tol> \n)

GTF_INFORMATION, RDF_INFORMATION:

Data concerning the target file and the candidate file, expressed as name = value pairs. Currently NUMFRAMES and NUMFILES are the only two values in both.

RESULTS:: Object Analysis results. Line is of the format <Type> <ID> (MISSED|FALSE|DETECT) <level> <Detected Comparison Values>. The <Type> is the descriptor type specified in the ground truth file. For MISSED and FALSE, level indicates on which level the descriptor was marked as incorrect; for example, a value of 3 indicates that the descriptor did not pass statistical localization. All DETECTED should indicate the same level. The <values> is of the form <frame span distance between target and union of candidates> <Number of candidates (1 if target match is set to anything other than ALL)> +\[<cand id(s)> <frame range distance> <attribute distances>] If ALL or DEFAULT is the target_match, there can be several bracketed matches. If MULTIPLE is the specified match, instead of object ID numbers, there may be bracketed comma delimited lists of ID numbers; note that in this case, <Number of Candidates> is set to one.
SUMMARY:: Also included in Object Analysis, the Summary section lists the result by descriptor type, followed by a sum. It is in the form <Descriptor or TOTAL> <# of targets> <# of candidates> <precision> <recall>.
FRAMEWISE_RESULTS:: Each line contains the frame number, followed by the pixelwise evaluation for the frame. The last line starts with the String Ã"Total,Ã• then includes the sum over the whole file. <ID/Ã"TotalÃ"> <Matched Pix> <Missed Pix> <False Pix> <Object Count Accuracy> <Average Fragmentation> <Average Object Recall> <Average Object Precision> <Localized Object Recall> <Localized Object Precision> <Object Count Recall> <Object Count Precision>
TRACKING_RESULTS:: Each line contains the tracking result for a single object, except for the last line, which contains the total coefficients. <ID/Ã"TotalÃ"> <Temporal Precision> <Temporal Recall> <Positional Accuracy> <Size Coefficient> <Orientation Coefficient>

Tools

gtf2xml, gtf2gtf, xml2gtf, xml2xml:: Takes a ground truth file from the input stream, or a list of files passed as command line arguments, and converts them to the specified output type. The gtf2gtf and xml2xml tools serve two purposes, to clean up files that contain improper formatting, or to combine multiple files into a single file. To merge multiple files in the older GTF format, make sure that they contain the FILE Information descriptor, described in the Ground Truth File Format section of the ViPER-GT manual. They also take two optional arguments, -split and Ãclip. Splitting splits objects that are not contiguous into separate objects, while clip returns the first frame of every object. Both are useful in evaluation circumstances; -split for evaluating algorithms that do not support tracking across occusions, and Ãclip for generating first-frame starting points for TRACKING evaluations.
viper-pe:: The command to run a performance evaluation, its options are described in a separate section.

Docutils System Messages

System Message: ERROR/3 (<string>, line 56); backlink

Unknown target name: "viper-gt manual".

System Message: ERROR/3 (<string>, line 72); backlink

Unknown target name: "scripting viper manual".

System Message: ERROR/3 (<string>, line 78); backlink

Unknown target name: "viper case studies manual".

System Message: ERROR/3 (<string>, line 720); backlink

Unknown target name: "viper-gt manual".

System Message: ERROR/3 (<string>, line 740); backlink

Unknown target name: "result".