Blog for work on my Masters thesis - a survey of methods for evaluating media understanding, object detection, and pattern matching algorithms. Mostly, it is related to ViPER, the Video Performance Evaluation Resource. If you find a good reference, or would like to comment, e-mail viper at cfar.umd.edu.
Archives
Media Processing Evaluation Weblog
Wednesday, September 29, 2004
City on the Edge of One Day, if We're Lucky
a.k.a. Big Brother Brunch, followed by Dave's Dissertation Dinner
So, I've been talking to Dave and Daniel about implementing a simulated multi-camera tracking system. The basic idea is that a) setting up an in house multi-camera tracking system is infeasible; b) it would be nice to work out a system that can test ideas for user interface solutions to such problems as 'when did this person walk in the door?' and 'how many times have these two people crossed paths?', among others; c) The Sims 2 is pretty awesome; d) Rodney Brooks is a cool guy; e) tracking is annoying when there are thousands of people walking in and out of Best Buy; f) can't you use those timelines for something more interesting? g) I'd like to be able to tell the system where it is wrong; and h) maybe the system should tell me where it thinks it is wrong.
The closest analogue I can see is to document systems. Let's say I am one person and I have a giant closet full of filthy, messed up documents. I scan all the documents, getting a loose OCR and topic track through all of them. I have one document that I know I need more information about. I can then scan back through and get involved in a nice feedback loop with the system; with each paper it shows me I give it guidance to cluster topics and correct errors in the OCR/segmentation/whatever, and at each presentation of another paper it gets me closer to the history of documents I would find interesting.
This sort of 'track mining' system might be useful to both the PETS and VACE community, and may be of interest to some sort of thesis committee. The focus will be as much, if not more, on the user interface than on the AI/CV issues. I think there will be some serious systems questions, if it gets to that point, but that is something to think about later, as the goal here is just the DotWorld implementation, which means I don't really care about (want to have to deal with) non-AI/CV/HCI related problems.
Right now, the first system will present a semi-random activity of dotPeops, which are then traced with a naive tracker deprived of key information to create 'track trees', a set of probabilistic tracking information. So, for example, if you ground a track on a certain frame, asking where the person is throughout the day, you will have regions of certainty (with the likelihood monotonically decreasing away from the ground point, using the naive tracker (model-based trackers, i.e. ones capable of detection, may have different likelihood profiles)) and will present with the user the choice of track at points of uncertainty (either 'choice points', like when two tracks overlap, or at minimum-derivative points on the likelihood curve, i.e. where the likelihood drops like a stone). In addition to the choices presented, the user is always given the choice of 'track lost', to indicate that the person is already incorrect, and the system will then back up to the next-to-last point of change, until the track is correct.
For the project, I will need a selection of useful queries and environments for use (basic user scenarios), as well as a good model for what the system will actually do. This includes things like key UI details, like the disambiguation interface, the model of the simulation, information about the end user display and system requirements, and other relevant information about the (dot)world. From this I should be able to start granting a mild form of life to storyboards.
- posted by David @ 6:02 PM
DotWorld and The Sims 2
Okay, so Daniel suggested using the Sims to provide a good simulation of the sort of situations that might be found in PETS or your local Tarjá. Since we can't go around collecting such video, and we don't really have the funding or patience to create ground truth for it, anyway, we need to have a method for simulating scenarios involving large numbers of cameras.
There are a variety of existing simulation environments out there. The most complete and featureful is The Sims 2, which comes with humanoid characters full of a wide range of emotion and activity. In the other direction, we find things like RoboCup and its companion RoboCupRescue, which provide only simple visual simulation, but realistic and complex physical interaction. Somewhere closer to the first end are MMORPGS such as Second Life and FPS games like Unreal 2004. Closer to the simulation of reality end comes ai planet and virtual quidditch.
On one axis we can go from machinima, which can now be edited with professional, Maya-like tools, to something even more automatic, where we have little control over the actors. On another axis we go from realistic visual simulation to simulation of low-level or mid-level vision, like blobs and dots. It seems that the machinima solution doesn't involve much of a benefit over blender, spe and makeHuman, at least for things that The Sims doesn't model. The real question is: how easy will it be to hack The Sims for multi-camera recording? Will we be able to take control of the people and record/edit/playback demos? And how difficult will it be to semi-automate ground truth extraction?
- posted by David @ 1:26 PM
Monday, September 27, 2004
Simulated Vision
One of the big problems with ground truth editing is that you have to record the events you wish to evaluate, with the appropriate number of cameras, and then go back and mark them up. Some of these events may be difficult or impossible to film, or to use the footage when it is filmed. So, a possible solution is to simulate the event in reality, or in some sort of computer environment. A third possiblity is to work from ground truth to generate simulated output of a lower level vision system for use in a higher level system. We already have several videos of actors walking around performing some sort of action: pick up a suitcase, switch cars, etc. The next layer of simulation would have computer rendered characters performing some action. The cheapest, highest level action would have dots with feature vectors performing the action.
- posted by David @ 3:53 PM
Wednesday, September 15, 2004
To Done?
So Far:
- Fixed two bugs from John McNitt (Oriented boxes stuck at current orientation; strings defaulting to "NULL" instead of the empty string)
- Added 'Script' Menu
- Ctrl+click center-editing for oboxes
- Hiding Attributes - adding show/hide toggles to attribute headers in the table
- Adding third, lock state to show/hide toggles
- Add icon for 'locked' state
- Enhance click-regions in current obox
To Do:
- Get ViPER to play well with time-only data
- Combine manual and tutorial documents
- Add shift+drag = constrain aspect ratio editing to boxes
- An actual script example (Ming Luo's shot segmentation?)
- Add prefs pane for obox editing
- posted by David @ 1:27 PM
Tuesday, September 14, 2004
To Do List
Just came back with a meeting with Dave, so I'm going to write down what we went over. There is a presentation on Tuesday, and all the data needs to be marked up by November, so this will be a busy week. In addition to fixing the things I mentioned to John yesterday, I'll have more on my plate than in a while.
Scripting (and Plugging Into) ViPER
So, the current method for plugging into ViPER, writing a javabean (any java object with a no-arg constructor) and adding it to the n3 file, requires a lot of understanding of, in no particular order: java, n3, viper itself, and whatever domain knowledge is required for the plugin. But most people just want to plug in an existing tool. So, I'm thinking I'll add a 'script' menu a la itunes, where the menu contains all programs found in a scripts directory.
Each script will be a stand-alone program or shell script. The input will be passed on the command line - the name of the currently selected media file. The output will be in GTF format, or xgtf format, and then imported directly into the currently loaded file. I could also add the current file to the input stream of the script. This should cover most of what people are asking for with ViPER scripts, or at least is a good maximum for what can be accomplished without much more knowledge. Here, you just need the necessary domain knowledge and the ability to write gtf, which is pretty straightforward.
For example, three obvious tools could be integrated this way: IFrame marking, automatic shot segmentation, and Face/Person tracking. I've already got IFrame marking with a standard plug-in. Today, I'll ask Ming for information about his shot segmentation tool. The idea would be to have this working before Friday.
Adding Hierarchy to ViPER
One of the key features that is supported by the subset of MPEG-7 that VideoAnnEx supports not supported in the ViPER data format is hierarchy. In VideoAnnEx, it seems to be restricted to keyword hierarchies, which should be simple to implement. I'll have to go through the actual XML that VAE outputs to see how it actually represents the things (my theory is as XML schema classes).
Tracking Evaluation
One of the main points that came out of yesterday's meeting with Larry was the need for an improvement in the state of tracking evaluation. The current tracking evaluations are based on aggregated box distances, and don't provide any real insight into the problem of tracking or even the proposed solutions. Improved visualization, better metricing of input data, and correlation of error regions with either the properties of the data or across tracking systems, would help redirect the experiment to a question that could be better than: how far away are things from tracks?
I need to talk to other members of our group to discuss what they are working on for VACE, and how they plan on evaluating or what thoughts they have on how they would evaluate their stuff. I should also make sure that the xgtf convert script supports resizing video (for frame rate and image dimension decimation, for example).
- posted by David @ 10:00 AM
Monday, September 13, 2004
ViPER Competitor: KLVs
KLV, a fairly straightforward binary metadata encoding format (triples of key, length, value - the length is, of course, not part of the metadata but a feature of it) . By providing a simple mechanism for encoding arbitrary key,value pairs, the format provides a simple mechanism for encoding the data as a stream and a very compact alternative to the verbose, non-streamable encoding of ViPER and the complex weight of MPEG-7, where it is often necessary to understand altogether too much about XML in order to get anywhere.
Another note is that SMPTE is developing a multimedia container format, MXF, which makes heavy use of KLV.
I will post more once I find some samples and a good list of some standard keys.
- posted by David @ 5:52 PM
Wednesday, September 08, 2004
ViPER Competitor: Ricoh MovieTool
The Ricoh MovieTool MPEG-7 editor, which is not available for download, as far as I can tell, provides several advantages over other MPEG-7 editors. It includes a timeline view of the temporal decomposition (that includes keyframes!) and an XML editor (not just the view that Var 7 provides) with an integrated MPEG-7 schema browser. Most of these features I found in the documentation.
- Cost
- ?
- Source
- Closed
- Platform
- Windows
- Media Formats Supported
- MPEG-1
- User Interface
- Looks very incomplete, but includes a few intriguing features, like an editable xml view and a tree-type timeline for story segmentation.
- posted by David @ 7:07 PM
ViPER Competitor: Var7 MPEG-7 Editor
The Var7 MPEG-7 Video Annotation Tool, from Ersin Esen at TUBITAK Bilten, presents a method for describing videos with MPEG-7 in Windows, or, at least, it claims to. I've been unable to download it, and it doesn't seem to provide much beyond what VideoAnnEx claims, or even as much. Its only notable benefits I can find include a view-source type XML viewer, using what appears to be the standard Internet Explorer control for display, and a very slight query browser. Its interface makes VideoAnnEx seem like an amazing feat of UI design, but I am unable to comment on performance or stability without access to a running program.
- posted by David @ 7:06 PM
ViPER Competitor: VideoAnnEx
VideoAnnEx, developed by researchers at IBM T.J. Watson - Hawthorne, presents the most usable MPEG-7 editor that I have been able to find. It automatically divides the video into shots, selecting a keyframe for each. It supports importing external shot segmentations, and allows the user to select a more appropriate keyframe and modify shots using right-click menus. In addition to segmentation, it can learn how to assign labels to regions of each keyframe.
- Cost
- Free?
- Source
- Closed
- Platform
- Windows
- Media Formats Supported
- MPEG-1, MPEG-2, probably other Windows Media formats
- Bugginess
- Fairly, it crashes pretty often, and doesn't seem to handle larger files well
- User Interface
- While ugly, it gets the job done, most of the time (sometimes widgets don't refresh properly, or the wrong frames are displayed in the selector)
- Performance
- Sweet - killerly fast, it relies on windows media to do the decoding, but still somehow supports near-random access through generated byte-offset index files.
- posted by David @ 4:45 PM
Friday, September 03, 2004
Beta 6
Well, this has gone on way too long. In a beta, you shouldn't introduce new features, and yet here I am, working on getting multiple selection to work. Other than that, this release focuses on getting bugs fixed. There should be little to no evidence that the selection model has changed, per the previous post, as we haven't implemented any of the required changes to display or select multiple objects in the video frame or timeline.