Home >> Media Group >> Research >> ViPER
Downloads -- Documentation -- Developers -- Discussion
LAMP     The Language and Media Processing Laboratory

ViPER Case Studies

When, and How, We've Used It


This document describes a few actual and imagined use cases for the ViPER set of video understanding performance evaluation tools.

Evaluating Text Detectors and Trackers

Text Detection in Video

One of the first sets of data evaluated for the VACE project was a set of frame grabs taken from various video streams. Several algorithms, developed by different teams 1, some with different parameters, were developed that attempt to detect text in video. One of the goals for evaluation was to determine the individual benefits of each algorithm. For example, we wanted to discover which perform well on text that can be automatically recognized while ignoring obscured or small text. We also wanted to examine the way the algorithms differ between text that occurs within the scene, like street signs, and graphic overlay text, like names in a news broadcast.

[1] SRI, University of Maryland (Huiping Li), PSU

Example of Text Detection Output

It should also be noted that text detection is different from text recognition. Software that reads text in a video will usually follow three steps: locate the text, then binarize it, and finally pass it to an OCR package. The first step, the detection of text in a given region of the image, is what we wanted to evaluate. In order to evaluate it, we needed videos with text, and annotation describing where in the video the text occurs: the ground truth.

Ground Truth

The first step to constructing good ground truth is to design a descriptor schema. A proper schema should describe the relevant data in such a way that it is both human understandable, able to be evaluated, and extensible. Ground truth data for long media files is tedious to create, so it is best to have complete, accurate data from the beginning. Beyond the ViPER descriptor schema, there must be an agreement about what the different attributes and descriptors contained therein mean, and a document of how to apply the meaning to judge the quality of the annotation. ViPER-GT provides the tools to create ground truth, but it knows nothing about what it means. For each set of ground truth we create, we also create a set of rules for the ground truth authors. This improves coherency between authors and gives the authors an authority to turn to when deciding how to markup an edge case. For our text data set, this included information such as how to select the quality level, at what granularity to place bounding boxes, and how to markup obscured text.

The first obvious attribute for our system was Location. Specified as an oriented box (obox), it allows evaluation based on the shape and position of the boxes of text that will be fed to a recognizer. The other obvious attribute is the text itself, a string (svalue). This will allows goal based performance evaluation, in addition to tests on the box. However, that is not enough. A quality level and an indication if the text is overlaid on the scene or within the scene will help to evaluate based on quality, as well as handling the question of what to put as the string for illegible text (since they are marked as zero quality, their strings can be ignored during evaluation). Because it may impact many algorithms, we also included the intensity of the text as an lvalue, either LIGHT or DARK.

For should be noted that since the data was a collection of individual frames, we decided to use all static attributes. This helps prevents a person creating or editing ground truth against accidentally having the same box span multiple key frames. This does not prevent it entirely, as it is not uncommon to have a logo overlay the same position on many frames taken from the same video stream. For the data that was videos, we opted to leave all of the attributes static except for Location.

To verify the correctness of the ground truth, we used the Overlay script to draw the ground truth boxes on the key frames, printed them out, bound them, and distributed the printed frames for verification against the ground truth files.

  FILE Information
    SOURCEDIR : svalue [static]
    SOURCEFILES : svalue [static]
    SOURCETYPE : lvalue [ FRAMES SEQUENCE ] [static]
    NUMFRAMES : dvalue [static]

    TYPE : lvalue [static] [ SCENE GRAPHIC ]
    INTENSITY : lvalue [static] [ LIGHT DARK ]
    READABILITY : lvalue [static] [ 3 2 1 0 5 4 ]
    REGION : lvalue [static] [ BLOCK LINE ]
    LOCATION : obox [static]
    CONTENTS : svalue [static]
    AREA : dvalue [static]
    NCHARS : dvalue [static]


Figure 1: Ground Truth Configuration

For detailed information about how to use the ViPER Ground Truthing tool, please see the ViPER-GT manual.

We also distributed to the ground truth authors a text file that contained a set of rules explaining how to resolve conflicts and interpret the meaning of the different attributes.

This project involves the ground truthing of text regions in video. This document will summarize the process and common questions for ground truthing specifically text in individual frames, but we will add information on tracking at the end.

We are attempting to summarize some of the issues we ran into, and hence some of the roots of inconsistencies in the ground truthing process, we are identifying some basic questions that should be resolved and adhered to when evaluating text detection algorithms.


1) What are the required and optional attributes of ground-truthed text need to be entered into the object table?

--Type (scene or graphic) --Intensity (light or dark) --Readability (0-5) --Region (default LINE) --Location (oriented box) --Region (block, line, or word) --Contents (text in ground-truthed box)
Automatic (generated from rest of GT)
--AREA --NChars


2) How small does text have to be before it is passed over for ground-truthing?

Answer: We are currently attempting to ground truth any text block that is reliably identifiable as text by a human observer, even if it is not readable. Such text will be assigned readability 0. If a human cannot identify at least the text line, then it does not need to be identified.

In general, if the line is discernable, ground truth to the line level.


3) What do you write in the content field for ground-truthed text that is not readable or understandable when it is read (i.e. a foreign language)?

Answer: Currently it is left blank. And should remain this way.

Other options: #FOREIGN or #UNREADABLE or ???

ex. frame 3 --- books

Note: It is very important that the CONTENT field be as accurate as

possible. All characters that are READABLE should be included in the contents field. Only specify those characters that are readable, no extra chars.

When performing evaluation, we will likely only evaluate data with quality 2 or greater.


4) What is the granularity at which to ground truth? If a frame is covered with mostly text, should you ground-truth every word, every line, or just a few big chunks?

Answer: Currently inconsistent. Ground truthing should be done to the line level for any text where the we can enter content or identify the line. For some small blocks of readability 0, with no distinct lines, a block is acceptable. Orientation must be used when possible.

  • If text is separated by a space(s), usually this is where separate boxes should be created.
  • If you have a scoreboard, or a table of some sort, it is kind of difficult to ground truth all of the characters.
  • However, for more accurate results, the coder should try to break the text down as much as possible.
  • If the text is small (eg. some of the text on a scoreboard, or in a scene) and not totally distinguishable, it is okay to group text here, even if there is space between.


  1. How is light text distinguished from dark text?

Answer: If the background is darker than the foreground text, the text is light. If the background is lighter than the foreground text, the text is dark.


6)----How is scene text distinguished from graphic text?

Answer: Scene text is anything in the background of what is actually being filmed, even if it is generated with computer-graphics.

For example:
  • words on the bottle of water that someone is drinking
  • words on a poster that is within the picture
  • words on a sign that is part of a picture
  • words on a candy bar wrapper that is part of the picture

Graphic text is anything overlaid into the picture. Graphic text does not move with the scene.

For example:
  • television channel numbers
  • television captions
  • sports score updates


7)----What are the general distinguishing factors between the different
levels of readability?


0: Humans can only tell it is text from context clues. No chance of reading it in the absence of context. ex. Obj ID 28, frame 7, books
1: Slightly difficult to understand, but can understand some of the text. ex. Obj ID 22, frame 6, books
2: The entire text can be read and understood. Edges of the text are not sharp and distinct from its background. May include some scene text. Presents significant challenges for automated recognition. ex. Obj ID 27, frame 7, books
3: Easily human readable text. Much sharper and clearer version of Readability 2 type texts. Some hope of OCR with enhancement. ex. Obj ID 50, frame 16, books
4: Clearly readable text. Good separation between background and text color. Usually graphic text because of clarity and sharpness. OCRable with some basic enhancement ex. Obj ID 63, frame 25, books
5: Perfect, large text.... easily segmented and read with an OCR system.

We can provide some examples.


  1. At what granularity do we ground truth? (Block, Line, Word, Character?)

Unfortunately, this factor changed during the process and is a significant source of inconsistency. We have been saying if the block is spatially compact, that it can be a single block. However, it appears that we should add that if properties of words or phrases in the block differ, then we should split the block. For example, if size, font or intensity differs, they should be considered different blocks.

Special Note: The most common and frequent choice will be "LINE"


Suggestions: Provide examples of each type of text as a reference.



When verifying the data, here are some things to look for. If you find additional issues, please add them to this list.

  1. Missing Boxes - All text should be identified at the block or line level
  2. Wrong Contents
  3. Inconsistent or missing quality (see examples)
  4. More then a single frame ground truthed in one record (look for strange frame range).
  1. Uncertain text type

The rules are useful for both the annotators and those involved with the evaluation.:

  FILE Information
    SOURCEDIR : svalue [static]
    SOURCEFILES : svalue [static]

  OBJECT TextBlock
    BBOX : bbox [static]


  OBJECT TextBlock 0 1:1
    BBOX : "212 36 79 51"

  OBJECT TextBlock 1 1:1
    BBOX : "52 56 119 55"

Figure 2: UMD Algorithm Candidate Data Sample


For each algorithm, we generated result data. Each result file required a different Evaluation Parameters file, as each one has a different configuration, requiring a different EQUIVALENCIES section. Since this is the only difference, it would also be possible to set this section using variable replacement in the RunEvaluation script included with ViPER-Viz. The EVALUATION section only names the Location attribute of the Text descriptor. The GROUND_OUTPUT_FILTER section varies between different runs to select different types of text: graphic overlay versus scene text, and high quality versus low quality text. ViPER-PE has no method of ensuring the semantic validity of an evaluation that returns a good result, it makes certain that people can.

The first set of experiments used Object Analysis. The metrics and tolerances must be set for this type of evaluation Since the data is only on single frames, the frame span does not have any impact, and is so left at the defaults. For the boxes, we used a dice metric with a tolerance of .9. Since text often suffers from splits and merges, we used a MULTIPLE target match type. For PIXELWISE evaluation, we left the default values.

  OBJECT Text [- -]
    LOCATION : [dice .99]

  TextBlock : Text


Figure 3: Evaluation Parameters File for GRAPHIC Text with UMD Equivalency section

*                   INPUT PARAMETERS                *

  Properties: /fs/lamp/Projects/VACE/Evaluations/Text/DryRun/Properties/textdetect.pr
Ground Truth: /fs/lamp/Projects/VACE/Evaluations/Text/DryRun/GTF/all.gtf.xml
     Results: /fs/lamp/Projects/VACE/Evaluations/Text/DryRun/RUNS/UMD/RDF/all.rdf.xml
 Eval Params: /fs/lamp/Projects/VACE/Evaluations/Text/DryRun/RUNS/UMD/EPF/dice-graphic.epf
         Log: -
      Output: dice-graphic.out

       Level: 3
       Match: MULTIPLE      
      metric: median
   tolerance: 0.99
   dont_care: 1.0
   pix_local: 0.6

*              GROUND OUTPUT FILTER                *
        TYPE with rule: == "GRAPHIC"
*                       METRICS                    *

OBJECT Text     dice    0.8
        LOCATION obox   dice    0.9900000095367432

*                      LEVEL 0                      *
*                  FALSE DETECTIONS                 *
OBJECT TextBlock 85 1:1
               BBOX : "224 0 39 15"

*                     DETECTION(S)                  *
OBJECT Text 200 1:1
               TYPE : "GRAPHIC"
          INTENSITY : "LIGHT"
        READABILITY : "3"
             REGION : "LINE"
           LOCATION : "31 189 33 13 0"
           CONTENTS : "5:22"
               AREA : "429"
             NCHARS : "4"

        OBJECT TextBlock 217 1:1
                       BBOX : "32 188 31 15"
                        DISTANCE(S): 0.09843400447427297
                        AVERAGE: 0.09843400447427297
        DISTANCE: 0.0, OVERLAP: 0.0, DICE = 0.0, EXTENT = 0.0, (START: 0, END: 0)

*                  SUMMARY RESULTS                 *

For Text: Precision is 29 %  (63/211)
For Text: Recall is 71 %  (93/130)

For TOTAL: Precision is 29 %  (63/211)
For TOTAL: Recall is 71 %  (93/130)

Figure 4: Selections from an Object Analysis Output File

We divided the data into a directory structure, making it easy to write a shell script to execute the complete set of evaluations. For information about how to create these scripts, please see the ViPER-Viz document.

System Message: ERROR/3 (<string>, line 448)

Image URI contains whitespace.

.. figure::
  caption: The x-axis is each target object that was detected, sorted by 
           correctness of the detection. [TODO: should be normalized by 
           object count]