Home >> Media Group >> Research >> ViPER
Downloads -- Documentation -- Developers -- Discussion
LAMP     The Language and Media Processing Laboratory

Blog for work on my Masters thesis - a survey of methods for evaluating media understanding, object detection, and pattern matching algorithms. Mostly, it is related to ViPER, the Video Performance Evaluation Resource. If you find a good reference, or would like to comment, e-mail viper at cfar.umd.edu.

Archives

Media Processing Evaluation Weblog

Wednesday, August 27, 2003

Keybinding

I'm currently redoing (doing?) decent keybinding support for the apploader. Previously, I just used the menu accelerator. But java 1.3's method for keybinding is pretty complicated, but seems to offer what I need. What I'd really like is a method for replacing it completely, and I may yet do so. First I will mirror it and try to work with it; this should be faster, although it will preclude changing keybindings during the run, at least without a rebuild of all the keybindings.

Icons

I added support for icons to the apploader yesterday. This means that the java logo in the upper-left corner of the application is a thing of the past. For now, I just replaced it with a variant of the lamp logo, the logo I've been using for the web site. I hope to make a decent icon for ViPER. I've made a couple, but they don't really look like good icons. With luck, I'll have an idea for a good one soon.

Tuesday, August 26, 2003

Okay, I did some testing. It seems that just using the video stream's seek (frame #) method to iterate through all frames, without converting them to RGB java images, takes a little longer than 4x real time. (I tested it on a short video from the lab and on 'duck and cover' from the Prelinger archives on my 1.7 GHz Xeon. This is a computer that is about 50x faster than what should be required by the spec, but only about 8x faster than what is required by some existing implementations. ) Dave seems to think Jonathan may have another way of seeking though a file (other than repeated calls to seek) that will be faster.

Monday, August 25, 2003

More Optimizing Notes

JUnitPerf

I think I need to add performance unit testing to the JMPEG code. This should help get me statistics on the problem, and maybe even fix it, eventually. The tool that I've found to do this, JUnitPerf, seems to have what I need. I'm going to have to check in some MPEG files to the source tree - I suppose I should use some stuff from the internet archive.

Native Code

As mentioned earlier, the new version of ViPER is too slow to be usable for real-time editing - specifically, the mpeg decoder, pure java clean-room cool though it is, takes too long (about a second or two) to load a frame of video into the piccolo canvas. Profiling indicates that the problem is on the decoder end (about 90% of it, anyway). Invoking Amdahl, this means that if I got the decoder to be an oracle, it would still be slow (about a third of real time, or eight frames a second), but it would be usable for purposes of tracking.

In order to improve performance, and expand to other codecs than MPEG-1, we will need to use native implementations, preferrably through something like FFmpeg. There is an abandonded project for a JMF FFmpeg wrapper; I don't know if FFmpeg is frame accurate, and I don't know if JMF let's anything be frame-accurate with any certitude, but it is worth investigating. Another thing to note is Scilla, a java media server, which uses FFmpeg, probably via the command line.

Googlosity

Piggybacking onto Freshmeat and SF was a good move for my googlejuice. Unfortunately, with great power comes great responsibility. As the number three hit for 'java mpeg decoder,' I probably should add some information about how to include Jonathon's MPEG decoder in other code, preferably offering a jar and some javadoc. I'll prepare something. Unfortunately, it will probably have to wait until after some work is done to main file server here at CfAR. I hope that is finished soon; the work has cut off my CVS access.

Meeting Notes: Monday, August 25, 2003

Dave discussed meeting procedures, noting the large number of people showing up (Summer is over). We'll be meeting on Mondays at 10:30 during the Fall semester. Since next Monday is Labor Day, the next meeting will be on the eighth. The order of presentations will be: Gary on the 8th, Nagia on the 15th and a Malach update on the 22nd, with the 29th still up for grabs. After that, there was a presentation by Ming Luo, who wanted to discuss some of his work at MSR Asia. Ming wanted to present both on pyramidwise structuring for soccer highlight extraction and a spatial k-means approach to image segmentation, but he only had enough time for the former. He's submitting some papers on these to PCM 2003.

Pyramidwise Structuring for Soccer Highlight Extraction (Ming Luo)

Ming presented a pragmatic approach to scanning soccer matches for important events: goals and goal attempts, etc. This allows things like context-sensitive fast-forward, etc. He first covered some existing techniques (I didn't get any refs, sorry), which he divided into 'high level' and 'low level.' The 'high level' approah assumes extraction of semantic events (see Nagia's stuff on event detection); this often fails simply because the tower on which it is built - object tracking and identification - is far from a solved problem. The 'low level' stuff, which is often more robust and is becoming more popular, uses things like a bunch of low level features connected to fsms (which give poor precision) or hmms (which are limited due to the high-dimensional low-level fusion problem) (See daniel's stuff, etc.). So - a solution is to introduce intermediate level features.

Ming's solution was pretty effective. It operated on the dc images, which are significantly simpler than the whole image (basically, these are the ground color for each 8x8 square in an mpeg). The first step is to binarize the image (something along the lines of 'field/not field', which uses a shade of green that is learned from the match. I think he said he took into account the striped pattern that appears in certain kinds of grooming of grass, but I can't remember how.) Features are extracted from the binarized image, including the base line (usually the line beneathe the ads around the field), fraction of green visible, and the object summing size ratio (which is a function of the size and number of the objects found). The idea is that using these features, and how they change as a function of time, goals and goal attempts can be detected. Multiple attempts can then be built into an attack, and then into a longer event referred to as a GOA, which I think stood for 'Group of Attacks.'

He acheived 100% recall and 60% precision for detecting goals, using a rapid increase in far-side detection. At the higher levels (GOA), false alarms would be goal kicks (uninteresting) and guards passing, and misses would be low-quality (low speed) attacks, captions, and places where the zooming is such that no baseline is detected. He also expanded the system to use hmms to detect corner kicks, but didn't have much luck. In the user interface, you could divide the game into halves and select 'play back attack on team A,' thanks to the convention that teams switch sides at the 45 minute mark and the televisions always keep the cameras to one side (respect the 180 degree line). Ming also noted that MSRA was working on other sports.

Friday, August 22, 2003

Options Parsing

So, one of the stages of the application loader is an options parse, which uses an ontology for command line options. (This reminds me: I should seperate the ontologies (just RDF Schemata, now) into logical namespaces, instead of the current 'by which class used the namespace first' method.) Looking around for a similar tool, I ran across Optik. It seems they did it the same way I did, which isn't suprising, as optik and apploader are both based on the GNU standard method for handling arguments. However, I also provide uniform access to java properties and environment variables, and method for internationalization. Is it worth the ridiculous overhead of loading and parsing an RDF file, then converting it into the internal format for options? For long running programs, I think so. For short CLI apps, there should be a way of compiling the arguments.

Link

AutoOpts is another tool I found on Freshmeat with similar goals. It generates C code for handling options, with optional hooks for Guile and unix man page generation. Interesting stuff - I really should look into generated documentation.

Link

Profiling Java with Eclipse

The new version of viper is ridiculously slow, sometimes taking up to two seconds to load a new video frame. This is unnaceptable when we are trying to get real-time editing working. Hopefully, I'll be able to optimize it a little.

Link, Link

Wednesday, August 20, 2003

ViPER Attribute Types

So, the viper data format is basically an externalized SQL-style relational model. Each descriptor type is a table, and each sourcefile is like another db with the same schema. One thing to note is that all attributes allow null values, something that you can specify on a per-column basis in ansi sql. Another thing is the presence of default values; the default default value is null, in fact. So, there needs to be a way to allow the attributes to be 'default'? Or does there? Once the attribute is created, it is left as default. Would it be better to have them all change when the default value is changed, or to have them stay the same? I am sticking with 'stay the same' for now, until someone can make a good argument the values should change.

Tuesday, August 19, 2003

Improved caching for MPEG Frames and Indexes

JCache is a specification for caches in the Java framework. It can be used for lots of things, but it appears to provide some nice features that go beyond the current hashtable-based frame-caching viper-gt uses. There is an open source implementation of JCache, as well.

Project Plan

So, we need a working v4 for the end of October. I'm going to shoot for the beginning of October, because I know there will be a lot of bugs to work out and a lot of stuff we forget in the creation of the alpha. Since for the first time in while I'm going to have some help, I think I should try to make the most of it. With that in mind, I'm actually going to bother to write up a project plan, with the help of ganttproject.

With the new system, there isn't that much interdependency. There are a few things that will rely on things that aren't feature complete.

Protégé and Jena2

Protégé, an ontology editor, uses Jena2 to interface with OWL. They are much higher profile than me, and might have some luck getting Jena to integrate a listener interface.

eclipse and haystack style layout for panels

The more I want to add to ViPER-GT, the more plug-ins and views, I mean, the more I want a system to handle panels, tabs, and windows uniformly, like Eclipse.

Jog Video Control Bean

The jog viewer should be implemented as a seperate control. I'd been sort of hoping to get rid of the old range slider and its buttons completely, with a nice fully-formed chronicle control, but I have realized that the chronicle widget is not well suited for all things, and 'jump to frame' and jog dials are not part of its domain. At first I thought 'maybe you can throw the time cursor at the rate you wish to move,' but I don't want to institute a weird gesture paradigm for something that I want the use a USB knob for. Plus, I don't know how to differentiate a 'throw cursor' gesture from a 'scrub' gesture.

I'm not sure how to handle interaction between playing video and scrubbing the chronicle. I think I'll have it keep the current speed, but another possibility is to have it pause. Since the 'pause' feature will likely be mapped to the 'click' of the dial, I doubt the autopause will be necessary. Also, I think pause should work like the TiVo 'pause' or third fourth ffwd click, where it pauses not at the current frame, but at the frame that was visible a quarter of a second before.

In order to get rt-editing using the mouse, keyboard, whatever, I need to add support for some sort of automatic editing mode / state stuff. Keyboard editing should be easy enough, as a keyboard event is discrete and can be applied to whatever the focus frame is at the moment (although I should perhaps apply the post hoc .25 second rule, like the pause function). So I won't be able to do 'fully automatic editing', but the mouse button will have to be pressed (I'd like to avoid issues where the user using the on-screen pause button causes the trajectory of the object to go in that direction).

Setting up a Development Environment

Editing the Code

To have access to the code repository, you need a CfAR account, only available to members of CfAR. Other people may submit patches to me directly, if they like. I am currently using eclipse to develop ViPER, although there are still some make files (that are probably out of date) if you want to work from the command line. With eclipse installed, go to the 'CVS Repository Browsing' perspective; this might require using the Window :: Open Perspective menu. Then bring up the context menu on the CVS Repositories view, and select New :: Repository Location. The viper repository, hosted on bo.cfar.umd.edu and found in the /fs/lamp/Projects/cvs path, uses the extssh protocol and requires your cfar login information. When the repository is added, open the 'HEAD' branch and open the context menu for the 'viper' folder, selecting Check Out as Project to create a new ViPER project.

With the project set up, switch to one of the Java perspectives. The run menu is only available from the java perspectives and the debug perspective. There is no way to associate 'run' commands with a project, as far as I can tell. If you know how, tell me. Anyway, to create a run menu item for viper-gt, use the Run :: Run... menu item. Select 'Java Application' from the 'Configurations' tree view, and click the New button. In the name box, type 'viper-gt4', or something similar. The project should be 'viper', and the class name is edu.umd.cfar.lamp.apploader.AppLoader. Click on the arguments tag, and add the -ea and -Dlal.prefs="workspace\viper\gt\CONFIG\gt-config.n3" VM arguments (in the lower text box), where workspace is the absolute path to your local workspace. For running the configurator, change the name of the N3 file from gt-config.n3 to gtc-config.n3.

Annotating and Creating Bug Reports and Feature Requests

A SourceForge account is required. After you pick one up, send me an e-mail (through my mihalcid sourceforge account, preferably) requesting addition to the team. You can add requests directly to the bug list and rfe list.

Editing the Web Site

To edit the site, you need to check out the stuff from the SourceForge CVS server. If you have an account and are added to the project on SourceForge, you may follow the instructions on how to use sf's CVS. I use eclipse, so I have two projects in eclipse, viper and viper-web. The site itself is pretty idiosyncratic, seperating each page into config, head, and body, and using a standard index template for all pages. The site is generated every night using the bake and publish scripts, as described in upsite.cron (all of these scripts are in web/bin). To add a new page, you must create a new directory with the three necessary files. There is a script for doing this, mkhtml, but it is only useful from a bash command line, and not in eclipse, so you probably should just copy an old directory and replace its data with new stuff.

Semiautomatic Mice

Something mentioned in the ipr: allowing the video to play slowly while tracing a spatial object with the mouse on the canvas. This would be quick, and possibly save a lot of time. This would be useful in tracking people, for example. It is especially useful where all that is required is a centroid, although it could be used with boxes as well. One possibility is enhancing multi-pass editing, where the first pass is used to establish extents of an object, the second pass used to give the centroid of the box in 1/2 real time, the next pass to touch up the box's size, etc.

Monday, August 18, 2003

Needs for a New Canvas

The old canvas was pretty hackish. It didn't respond well to changes, and didn't have an OO method for adding new data types or interaction types. The new canvas must have an OO method for adding types and interaction, as well as a good method for handling keyboard interaction. It can use Piccolo, if the developer so desires, but this is not a firm requirement.

There are two canvases currently around: the old, fsm based editor canvas that uses Graphics2d, and the new, non-editable canvas that uses Piccolo. I think there is evidence that the new canvas is slower, but I think this might be to the ill-advised caching model it uses for video frames. The choice is between refactoring the old canvas heavily and creating a new one, while not from scratch, from a non-functional component.

Unsupervised Learning

There are some course pages at UCL about it: 2002, 2001 and 2000

Meeting Notes for August 18, 2003

Everyone is getting ready for the ipr tomorrow. I've got a little things about ViPER prepared.

Handwriting Retrieval Based on Integration of Multiple Queries, by Huiping Li et. al.

The only presentation today (it went long), this consisted of Huiping detailing a method for retrieving documents based on signatures (or initials, although much of the work would apply to logos). He started with a basic overview of the problem of handwriting detection and how it was important (detecting signatures, important metadata, editing notes, etc), and how it was often unsuccessful. The data is noisy, segmentation is difficult or impossible (connected components won't work), the handwriting style for a single person is very inconsistent, etc. Some work has been done in the lab on extracting noise/print/annotation layers from documents, and on removing horizontal rules (as in loose leaf paper), and this work uses those techniques for cleaning and detecting annotations. However, this presentation focused on retrieval using initials.

The preliminary system used Hausdorff distance, and Huiping didn't present results. In order to counteract the effects of distortion, scale, and (to a lesser extent) rotation, they used a method described by Belongie at PAMI-02. This involved finding points in context (information about the neighborhood of the pixel - a 60 element feature vector using a radial bucket system), finding the correspondences between the points, and getting the transform using a thin spline shape model. This results in two distance numbers: the point correspondance score and the cost of the tps deformation. The distance is weighted using the Fisher criterion. Without cleaning the initials (using the binarized scans), it achieved about 53% R-Precision. With cleaning, this improved to 73%.

Next, Huiping presented improvements suggested for the algorithm. The first one involved using a skeletal spline instead of a contour spline, and this improved the results by a point or two. I doubt the difference was significant, but I haven't seen the numbers. A greater improvement, around 5%, was acheived by running all queries at once. The complementary enhancement, querying using multiple examples, acheives an even greater improvement in R-precision, about 10% for 2 samples to 15% for four samples.


R-precision is one of many information retrieval metrics. Given there are N documents to retrieve for a query, the R-Precision is the percentage of the first N documents returned that are correct. Another solution is the mean average precision, MAP, which might be more useful. Dave also suggested displaying a graphical view of how well the ranking works.

Sunday, August 17, 2003

Activity Label Distance Metrics

Most of my previous work on defining metrics for activity detection has assumed that an activity was a simple label; matching an activity meant an exact match for a single label, which is a function of the frame number / timecode. This is not how many systems work, and is certainly not how people maintain definitions of activities in their heads. A slightly more complicated definition would allow the labels to be nodes in a DAG, like C++ or RDFS classes. This can be interpreted, under the definition of RDFS, as allowing activities to have multiple labels, with certain constraints on what other labels must also appear when a given label is assigned. I immediately see two metrics for dealing with this new type of activity labeling.

The first metric is a simple set-difference metric. This would give a precision/recall on the labels properly found. If we allow one label per activity, this would give the metric described below, with up-links counting against precision and down-links against recall.

A different metric could operate on the DAG directly, traversing the links from candidate label to target, with the distance a weighted count of the links. For example, a link down would be worth less than a link up. Possible values include (1, ∞) or some system using a function of distance from a single, root 'activity' label.

These methods can be extended from operating on sets of labels to operating on bags of labels. This will mean that at any frame in the ground truth for label L, there will be X critical sections and Y non-critical sections, where any value for the count of L in the candidate frame where this is not true: X <= |L| <= X+Y, will result in error. To handle 'sort of care' regions, the evaluation can be run twice, or weight the regions by how much is cared about them.

MPEG-7 Editing

One of the main competitors to ViPER-GT is a good MPEG-7 toolchain. There are scripts to convert some MPEG-7 data into the ViPER format, but both formats have their own internal data model, and the two are not compatible enough to allow completely automated translation. As time goes on, I expect MPEG-7 to take off and to have the ViPER format relegated to the position of a secondary format. Perhaps some day I will replace the viper format with MPEG-7, although I doubt it (I'd more likely replace it with OWL, I think).

There are a few open source MPEG-7 editors out there. These are:

Friday, August 15, 2003

BLEU: a Method for Automatic Evaluation of Machine Translation

Given the fields relatively long existence, machine translation has a decent sized selection of evaluation techniques. The current flavor of the month, and with good reason, is BLEU. BLEU compares a translated text to a set of human supplied translations to get a measure of how similar another translation is to that of a human professional. I like the technique, as it finds a way around the 'no such thing as a right answer' problem nicely for many different kinds of data, well enough to be very useful for the problem.

The algorithm works by getting the number of n-grams, for n in [1-4], of a candidate sentence that are found in the selection of target sentences, taking care not to count any n-gram more than the maximum number of times it appears in any target. This is called the 'modified n-gram precision.'

Reference Link
@inproceedings{papineni02,
    author = {Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu},
    title = {Bleu: a method for automatic evaluation of machine translation},
    booktitle = {Proceedings of the 40th Annual Meeting of 
        the Association for Computational Linguistics (ACL)},
    address = {Philadelphia},
    month = {jul},
    pages = {311--318},
    year = {2002},
    url = {citeseer.nj.nec.com/papineni02bleu.html}
}

Thursday, August 14, 2003

Tracking Provenence in RDF with Redland

Another feature missing from Jena, context. This isn't necessary for the AppLoader, but it would be nice, reducing the need for the PrefManager to maintain different models and integrate them every time something changes.

One-Dimensional Segmentation Evaluation

This is useful for various kinds of video evaluation. The space for segments is one half of the FxF space of ordered pairs, where F is the number of frames in the clip to be segmented. One way to look at is an undirected graphs, with each node a frame and links existing when the two frames are in the same segment. When we take the start frame on the horizontal and the end frame at the vertical, the upper corner, the pair (0, F-1), represents the first and last frame in the same segment. To perform ROC analysis, we would request an ordering of those tuples, with the (x,x) tuples having greatest and equal values and the (0, F-1) having the least value. For segmenters that only identify contiguous regions as part of the same segment, this space will be shaped such that v(x,y) will be greater than v(a,b) for a <= x, b >= y.

There are some flaws with using a simple ROC evaluation on the graph. First, it may be difficult to acquire information in this fashion. Also, the real evaluation should be on the cliques, not the pure thresholded numbers, as these will be the result of an evaluation at a given level. For example, a program could cheat, saying each frame is linked to each frame less than or equal to k away. Although this doesn't contain (much) information about how to perform a segmentation, situations may arise where it scores better than other, more reasonable segmentations. This is a characteristic of this evaluation: it is essentially asking for 'what's related' to this frame, along whatever dimension the classifier uses.

Another possibility, one which seems to have disappeard from use, is to calculate precision and recall on the segment barriers themselves. This has at least two disadvantages: it is overly sensitive to movements of the barrier, and it doesn't allow discontiguous segments to be evaluted as one. Also, like the first segmentation evaluation described above, it may require an enhancement of the segmenter to produce the needed output.

A slightly more useful method of directly evaluating based on barrier position is to develop a 'barrier edit distance,' a metric that combines movement of barriers, creation, and deletion into a scalar representing minimal 'effort' required to transform from the candidate segmentation to the target one.

Text Segmentation Using Exponential Models

This paper describes a system for segmenting text into articles. They provide a probabalistic error metric: the probability that two sentences drawn randomly from the corpus are correctly identified as belonging to the same document or not belonging to the same document. They give greater weight to sentence pairs that are nearby.

Reference Link
@incollection{beeferman97
    author = {Doug Beeferman and Adam Berger and John Lafferty},
    title = {Text Segmentation Using Exponential Models},
    booktitle = {Proceedings of the Second Conference on Empirical 
        Methods in Natural Language Processing},
    publisher = {Association for Computational Linguistics},
    address = {Somerset, New Jersey},
    editor = {Claire Cardie and Ralph Weischedel},
    pages = {35--46},
    year = {1997},
    url = {citeseer.nj.nec.com/beeferman97text.html}
}

Topic Detection and Tracking Pilot Study Final Report

This paper, which outlines the fundamentals for a system like google news, extends Lafferty et al.'s metric to take into account precision and recall, instead of a simple accuracy score. Google News uses documents to define 'clusters' around a topic, as described in this paper.

Reference Link
@inproceedings{allan98
    author = {James Allan and Jaime Carbonell and George Doddington and 
        Jonathan Yamron and Yiming Yang},
    title = {Topic detection and tracking pilot study: Final report},
    booktitle = {Proceedings of the DARPA Broadcast News 
        Transcription and Understanding Workshop, 1998},
    year = {1998},
    url = {citeseer.nj.nec.com/allan98topic.html}
}

Image Segmentation: Quantitative Evaluation

I found a page that lists a bunch of methods for evaluating image segmentation algorithms. I haven't had much of a chance to go through the list, yet. However, it does include references to the Koester and Spann and Zhang papers I referenced earlier.

Wednesday, August 13, 2003

What is the deal with Transym OCR? And is Clara still going strong?

It is looking more and more like LaTeX should be the markup language of choice for the project, thanks to Tex4Moz and tex4ht. These projects should make it simpler, and perhaps allow automation, of a site in somewhat human readable tex form to be translated into html.

Ground truth data for document image analysis (MARG and ROVER)

This article describes the the MARG repository and ROVER, an outgrowth of TrueViz (description), that includes pe capabilities, or at least gtfc capabilities.

Reference Link
@article{Ford2003
   author = {Glenn Ford and George R. Thoma},
   title = {Ground truth data for document image analysis},
   book = {Proceedings of 2003 Symposium on Document Image Understanding and Technology},
   year = {2003},
   month = {apr},
   pages = {199--205},
   location = {Greenbelt, MD}
}

Tuesday, August 12, 2003

Use Case: Activities in a Warehouse

This use case emphasizes the need for ROC evaluation in viper-gt, as well as the need for enhanced don't-care regions.

Use Case: Person Tracking

One of the things this use case scenario brings up is complex methods for generating don't care regions - essentially requesting set operators between attributes. This would be a good feature to add to viper-pe, but it would be even better to create a method for including generated attributes and other methods of inline scripts with the data or evaluation.

Use Case: Text Detection

This is a less edited version of the use case scenario that can be found on the viper web site.

This is a problem of finding text (characters, either overlaid on the video or in the scene) in a noisy environment (video, photographs, whatever). We have had some experience with this. The simplest case is finding frames that contain text, while the lowest level is detecting pixels which belong to characters. Somewhere in between comes evaluation through text extraction and evaluation using bounding boxes. ViPER supports all but the lowest level of evaluation - that at the level of the individual pixel. This could be fixed with the introduction of another data type, but is not recommended. You cannot merely ask "how good is this software?" but "how well does this software answer question X for data set Y?"

For our purposes, we are evaluating text detection at the line level. This is more coarse than adding boxes for each character. When developing ground truth, it is often advisable to first develop scenarios for the evaluation, to make certain the truth meets the requirements for the test. For example, if we wish to retrieve character-level or word-level correctness statistics without the variability of OCR results, we must put boxes around individual words or characters. Our line-level truth prevents these metrics from being calculated. However, we did transcribe the text of each line, allowing OCR-based matching to take place. While the two results (character box accuracy, character recognition accuracy) are often highly correlated, this will not be true for all kinds of data (and becomes less true as the quality decreases), precisely the kind of data we were evaluating - text found in broadcast video.

In fact, it took us several iterations to arrive at line-based. Originally, we would mark up whole segments of text as one item. Unfortunately, this was too coarse, as some edges of text boxes are very jagged, leaving much space within the box without text beneath it. Also, it was difficult to decide how to add the text to the metadata, as the old format did not support new lines and the spreadsheet entry format is not amenable to them.

And what of text that is illegible? We developed a sytstem of rating text quality from 0 (illegible) to 5 (well defined and clear). It is difficult to define what an OCR algorithm will find legible, and there would be some error, so running multiple evaluations accepting [5], [4-5], ... [0-5] as input will give a reasonable curve displaying how a system breaks down, but comparisons between programs at any one level would not be prudent. We attempted to record the value of any legible strings, and left the illegible strings with a null value. Unfortunately, we did not mark up legible text in non-roman scripts, instead leaving their values null, precluding goal-based evaluation on those selections.

To evaluate, the data, a set of keyframes extracted from several different clips ranging from a grainy broadcast of "Bobby's World" to American, Russian, and Arabic news footage to conference seminar footage taken with a handheld camcorder. Since the genre and content varied so widely, we arranged the data by content, allowing gtas to markup groups of keyframes as one file by genre.

In addition to specifying legibility of lines of text and each frame's genre, we divided text into scene text (text on objects in the scene) and graphic text (text in overlay graphics, like pop-up text on CNN), and by contrast style - light text on dark background or the reverse. Both classifications may be subjective - for example, text in animations may be considered graphic, and animations may appear over a scene. Also, text may be the same intensity as the background, but vary in color or texture.

There are two major types of evaluation that viper provides that are useful for this type of data: objectwise and framewise. The object evaluation attempts to match the objects together, and counts objects as matches when the 'metric distance,' defined by the evaluator, is less than a given threshold. The frame-by-frame evaluation looks at each frame, and each pixel on the frame, and calculates a standard set of metrics on each frame. We created templates for the two evaluation types and runeval-data files for several different metrics and subsets of data, as well as a few runeval scripts for generating charts comparing the different algorithms with different data sets.

The objectwise evaluation, called OBJECT_EVALUATION, has three levels. For a complete description of the evaluation, see the viper-pe user guide. The template is of the form:

#BEGIN_OBJECT_EVALUATION
OBJECT Text [- -]
    LOCATION : [<boxMetric> <boxThreshold>]
#END_OBJECT_EVALUATION

#INCLUDE "../equivalencies.txt"

#BEGIN_GROUND_OUTPUT_FILTER
OBJECT Text
    <filter>
#END_GROUND_OUTPUT_FILTER

The first block is the evaluation block - required for any evaluation. Note: Newer versions of the viper-pe command line tool allow multiple evaluation blocks in a single run, but the currnet graphing scripts can only handle raw files with output from one run.

$PR = textdetect.pr
$GTF = all.target.xml
$RDF = all.candidate.xml

$NAME = object-highQ
* <boxMetric> = dice
* <boxThreshold> = .99
* <filter> = READABILITY: == 4 || == 5

#NAME_GRAPH object-highQuality
#RUN_EVAL
#RUN_GRAPH

#END

This small RunEvaluation script, when used with the object template, will run a single object evaluation and create associated graphs, with a dice metric set to the very low threshold of .99 (this should catch most boxes that overlap at all), making the sorted distance curve more meaninful. For these evaluations, we used the hungarian algorithm, specified by passing the parameter target_match to SINGLE_OPTIMUM. Another approach would be to set it to MULTIPLE and allow multiple matching. This would fix some problems with split or combined boxes. FIXME(Why one over the other)

For the framewise evaluation, we focused on the pixels counted as also described in the user.

#BEGIN_FRAMEWISE_EVALUATION
OBJECT Text 
    LOCATION : <frameMetrics>
#END_FRAMEWISE_EVALUATION

#INCLUDE "../equivalencies.txt"

#BEGIN_GROUND_OUTPUT_FILTER
OBJECT Text
    <filter>
#END_GROUND_OUTPUT_FILTER
$PR = textdetect.pr
$GTF = all.target.xml
$RDF = all.candidate.xml

$NAME = frame-highQ
* <frameMetrics> = matchedpixels missedpixels falsepixels fragmentation \
                    arearecall areaprecision [arearecall .7] [areaprecision .7]
* <filter> = READABILITY: == 4 || == 5

#NAME_GRAPH frame-highQuality
#RUN_EVAL
#RUN_GRAPH

#END

The current implementation of makegraph requires the first four frame metrics exactly as shown here. The others can be modified, but only slightly. These are the same metrics as recorded in Rangachar Kasturi's paper on the subject, which was used to the VACE evaluation of text detection.

One of the algorithms, submitted by CMU, even found a line of text that our editors missed. The text, a word written in the background, was obscured, and probably would have rated a 'one' or 'two' in terms of quality.

So, why doesn't viper support pixel-level evaluation? For a time, it supported it in the evaluation stage - all boxes and polygons were converted to bitmaps, and the evaluations would be done based on these. This was too slow, taking hundreds of times longer than the analytical approach. Another solution would be to use a slightly smarter method than simple bitmaps, a better data structure. However, this would likely not improve matters, as the size of the frame is constant, and a better structure would only improve things on a per-frame level, in the current system which evaluates each frame discretely. I haven't bothered to add support in viper-gt. The polygon function should be sufficient for most needs, although its user interface, as of version 3.6, is not up to the task.

Monday, August 11, 2003

Meeting notes for August 11, 2003

Surveillance Video Compression, Yang Yu and Huiping Li

Motivated by the arrival of ubiquitous cctv, and the following need for quick and efficient archival mechanisms, the lab, and Yang and Huiping in particular, are looking into ways to improve surveillance video compression. The idea is that some of the same analyses that apply to understanding and vision apply to compression, e.g. background subtraction and object segmentation, especially for object based codecs like MPEG-4. The goal of the project is to develop quick, new ways to enhance compression.

The first idea was to skip background-only frames. The simple exploratory experiment Yang presented compared MPEG-2 video compressed at a constant bitrate with the same video sans the 26% of frames that were uninteresting. Unsuprisingly, this resulted in a 26% savings in file size. She mentioned similar improvments in an MPEG-4-simple (i.e. frame-based encoding) experiment, but didn't show any results.

The second example used background subtraction to segment foreground objects from the background, then used MPEG-4 to encode the background seperately from the sprites in the image. The background subtraction algorithm was trained on a sample of 100 frames, a process which took less than a minute, and then produced the segmentation. The video was recombined with the MS encoder. An object encoding with the direct segmentation resulted in an 87.9% savings over the frame encoding, but lost a lot of information. By increasing the bounding boxes around the two sprites in the video by eight pixels, most of the information was saved at a cost of 3 or 4 percent less compression. Better post-processing, like adding something to smooth out changes in the boxes over time (e.g. kalman filterin or some other method of removing hf changes), may help. However, since only one sequence was tested, more work will have to be done to answer the question of when to use object segmentation and when to fall back on frame-based encoding. The real-time requirements of the problem make slower, more accurate solutions like dove-tailing impossible.


After the presentation, we decided on a time for fall meetings: Monday at 10:30 am. Dave also talked about ICDAR 2003; he mentioned that UMCP had plenty of talks, and a poster, but said that we should have participated in the competitions, citing the surprise language as an example of how a competition can improve your profile and increase productivity. He also encouraged submitting tech reports, and mentioned that a new focus will be on handwriting, with a new post-doc, Stefan Yager, arriving with a background in the area. Daniel also mentioned the missing water cooler, and that the default temp has gone up; Dave says it was raised to 78 because of the budget cuts.

Sunday, August 10, 2003

Experimental Environments for Computer Vision and Image Processing

A book, published in 1994 and part eleven in World Scientific's Series in Machine Perception and Artificial Intelligence, the text presents a series of articles about computer vision systems for use in vision research.

Link Reference
@book{Christensen1994
   author = {H I Christensen and J L Crowley},
   title = {Experimental Environments for Machine Vision},
   editor = {H I Christensen and J L Crowley},
   year = {1994}
}

An Evaluation of 3D Segmentation algorithms using Seismic Variance Data

Link

Annotated Bibliography

My current list of resources for the survey:

Fingerprint Verification System

While searching the net for references to the earlier fingerprint image enhancement paper note, I came across a sourceforge project for an embeddable fingerprint verification system. I'm not sure if it does much of anything yet, but it seemed interesting.

Empirical Evaluation of Laser Radar Recognition Algortihms Using Synthetic and Real Data

This paper, co-authored by CfAR alumn Qinfen Zheng, creates a simulation for LADAR imaging and compares it to a few simple tests of a real LADAR rig, then presents confusion matrixes on real and synthesized systems. Given the decrease in accuracy for real images over synthetic, it appears that the simulation is not accurate enough to give predictions for outcome, or even for comparison of systems (many of its results are 100% accurate - an evaluation that contains little room for improvement), it may be useful for debugging or more complex situations where confusion is more likely.

Reference
@article{Der1998
   author = {Sandor Der and Qinfen Zheng},
   title = {Empirical Evaluation of Laser Radar Recognition Algortihms Using Synthetic and Real Data},
   book = {Empirical Evaluation Techniques in Computer Vision},
   year = {1998},
   pages = {135--147}
}

Fingerprint Image Enhancement: Algorithm and Performace Evaluation

This paper offers a description of the problem (automatic fingerprint matching) and standard techniques, and then describes a method for enhancing the generally poor quality fingerprint images to improve matching. To evaluate, they used a ground-truth based goodness index method, basically a method of assigning a score to the improvement by comparing the enhanced image to expert derived truth, and a goal-directed evaluation using a fingerprint matching database.

Reference Link
@article{Hong1998
   author = {Lin Hong, Yifei Wan and Anil Jain},
   title = {Fingerprint Image Enhancement: Algorithm and Performace Evaluation},
   book = {Empirical Evaluation Techniques in Computer Vision},
   year = {1998},
   pages = {117--134}
}

Sensor Errors and the Uncertainties in Stereo Reconstruction

Presents an overview of stereo reconstruction and their technique of dropping questionable reconstruction points through a method of deriving intervals of uncertainty. It also includes a quick overview of estimating CCD sensor noise, and a few interesting examples of error in different cameras (the Hitachi KP 230(231)/TIM40 and three difference Sony XC-77s).

Reference Link
@article{Kamberova1998
   author = {Gerda Kamberova and Ruzena Bajcsy},
   title = {Sensor Errors and the Uncertainties in Stereo Reconstruction},
   book = {Empirical Evaluation Techniques in Computer Vision},
   year = {1998},
   pages = {96--116}
}

Friday, August 01, 2003

PVR

There are several projects out there working on PVRs. The major two are MythTV and Freevo, both of which seem pretty featureful. Behind them is eBox, another hobbyist project. One of the more interesting projects is NMM from a university in Germany. Still, none of them look like they run on windows, where the best options for video are through VirtualDub and VideoLan, neither of which offer PVR capabilities, or Microsoft's Media Center Edition, which costs money and requires an MPEG-2 card.


Powered by Blogger