Archive for 2005/02

Blog for work on my Masters thesis - a survey of methods for evaluating media understanding, object detection, and pattern matching algorithms. Mostly, it is related to ViPER, the Video Performance Evaluation Resource. If you find a good reference, or would like to comment, e-mail viper at cfar.umd.edu.

Media Processing Evaluation Weblog

Monday, February 28, 2005

Learning Gallery

The next step is a surveillance system with an editable gallery of actors and clips. Basically, the user, with help from vision tools, is building a database of facts what the video recorded, usually facts about the location and interaction of objects. For example, the database could contain the location of each individual in a video set. With a few simple types of facts, the user can note many of interesting things.

The two main data types we will focus on are 'actors' and 'clips', which are very similar. An actor is represented by a sequence of bounding boxes, and a clip is an interval within a video stream. Both can be arranged into stories. Within this framework, the user and a vision system can both make statements or suggest theories about the state of the world.

Both data types can be seen as segments, or sets of segments, possibly qualified with some spatial feature - e.g. a bounding box for the actors. It may be more useful to think of a general system, where we have 'temporal things' which can be qualified with spatial, or nominal, ordinal or numeric, data. This will allow for a wider range of queries. For example, a user can then associate a name with a person, or a keyword with each of a set of clips. We may wish to quickly establish a small lexicon of terms - name, keyword, likelihood, location, etc. - that will be useful for all that we wish to do, allowing this dictionary to be extended at a later time.

Although the user interface presents the data as a set of facts, the gallery can be seen as a method of communication between the user and the vision system. As the user modifies the gallery, the vision system is given more information about the state of the world. The vision system may then go back and modify its results to take this into account, or in some way modify its output to fit the new data. The vision system incorporates feedback from the user as input, like Photoshop's image segmentation supports keypoints. Here, we have a set of statements about where people occur in a video, and what they are doing. The vision system must be modified to take as input the video segments as well as some additional facts that can must be true in the output.

As the gallery is currently organized into actors and clips, which the user will then want to correct. One possible, and common, error is when a person is identified as two separate people. This can be fixed in the interface by dragging one actor onto the other. Another error is when an actor is inappropriately identified, such as when a track follows the wrong actor after one actor walks in front of the other. There are several possible fixes for this; this can be fixed in some kind of timeline view, or it can be fixed with the assistance of track-summary overlays on a small number of video frames, or a single one. Another error is when a person is lost in the background; as Nagia is working with output from Ahmed's tracker, which means that the system will be unable to help. However, since the person is unlikely to be moving much, this should not be difficult for the user to annotate.

With the exception of errors in the background subtraction, the basic statements we can make, from above, are 'this actor is actor X' and 'this actor is not actor X'.

This information can be incorporated in any number of ways. For example, Nagia currently uses color histograms to determine identity when two tracks diverge; the identity information can be used to directly modify the color model. Another possibility is to view the connections between tracks as a graph, with the color histogram distance measure applied as weights; user applied identity information is then a modification of the edge weights. Choosing which method to use includes taking into account both quality of results and the expectations of the user. While modifying the detection models for each person may improve the overall quality of the output the most, it may also result in unexpected modifications to the gallery, including possibly creating errors where none existed before. This seems less likely to happen when the identity information is instead included directly when computing the identities at track splits.

The system can respond immediately to changes, or it may take some kind of direct intervention. The user should be able to select the method that is most effective, if there is any difficult choice.

- posted by David @ 10:29 AM

Monday, February 21, 2005

Gallery

The unstructured observations - the clips - the people - the actors and the objects - these go in the Gallery. Right now, the plan is to get a simple two-camera setup working, using video from Vinay, to perform simple gallery-type analysis, where we use the video to generate some unstructured observations.

- posted by David @ 3:14 PM

Doug's talk: Why ask a computer when a person can ask the computer for you?

At the tech review, Doug went over some experiments with a QA system, comparing the QA system to work that unskilled humans could do. The humans were quicker and more than twice as accurate as the best QA system evaluated. Anyway, it was interesting to see the experimental design, which was asking questions not far off of the ones I want to ask a surveillance system - just using text retrieval as the testbed instead of video.

So, what could I learn from it? The human experiment asked 16 questions to 20 people in different orders, and then computed the performance, both in accuracy and time consumed. The computers were given a different, and easier, task - 14 questions, and they had more time and could use more external resources.

- posted by David @ 12:45 PM

Wednesday, February 16, 2005

Surveillance Reports as Topic Tracking?

I just got back from a clip job talk from Yi Zhang. Her Yow Now news filter uses graphical models of belief to determine relevance of articles, based on user submitted feedback. My love of graphical models may still go academically unrequited, but the concept of topic tracking and news filtering still holds metaphoric relevance.

- posted by David @ 12:05 PM

Monday, February 14, 2005

Meeting w/ Larry, Dave & Daniel

Opened with discussion of proposing an ARDA challenge. These are decided sometime in mid-April. Daniel noted he would be on a three-week cruise to S. America in March. The basic proposal would be for integration of vision and HCI. Daniel suggested bringing dw before an HCI brownbag. Dave suggested going to Catherine Plaisant for help, and Larry suggested Aaron Bobick at gatech. Next, we discussed the system architecture, for elevator summary: three modules: video browser, gallery, and storyline. In my notes, I wrote 'where you observe, what you observe, and how you structure your observations'. Each module must work without vision; each module will be enhanced with vision. The video browser itself should present a variety of useful methods for quickly navigating video to the user. Also, it could be integrated with a map somehow to provide spatial coherence to the displayed feeds. Scrubbing quickly in time can be accomplished in a variety of ways. The most basic way that vision can be used to enhance this is by 'static video removal', with a slider determining how much dead video can be stripped. The gallery can be seen as an 'objects and people' view, or a 'notes' view. It can support multiple views. With the simple gallery, the user can arrange them into lists or assign keywords, with views operating on those. With the vision-backed gallery, the user can still arrange them, but content-based queries and other techniques can speed up the process. The storyline can be seen as a way of organizing the gallery, or it can be used exclusively. Currently, the storyline is used for generating chronological summaries of action. However, there could be uses for non-chronological editing (e.g. video abstracts). Finally, we discussed building a test case scenario using some two-camera footage that the lab has. The basic use case would first be to have a user build a gallery, then see how much some simple vision can improve gallery creation. This should give us some insight into how a user can work the system, as well as how dotworld should be modified to more accurately simulate real surveillance scenarios.

- posted by David @ 3:28 PM

Thursday, February 03, 2005

invalidate eleswhen

A useful groovy script to address a recent problem: this script crops all descriptors to only be valid when the current 'play only where valid' descriptor is also valid. This makes it a little easier to evaluate i-frame data. I still need to address modifying the interface so that you can only edit, or even see, those frames that are selected, but that will probably have to wait until after 4.0.

import edu.umd.cfar.lamp.apploader.AppLoaderScript
import edu.umd.cfar.lamp.apploader.AppLoader

import edu.umd.cfar.lamp.viper.gui.players.DataPlayer

import java.util.Iterator

import viper.api.Config
import viper.api.Descriptor

class InvalidateElsewhen implements AppLoaderScript {
 myIdentifier = "http://viper-toolkit.sf.net/samples#invalidateElsewhen"
 void run(AppLoader application) {
  mediator = application.getBean("#mediator")
  chronicle = application.getBean("#chronicle")->chronicle
  time = chronicle->selectionModel->selectedTime
  sf = mediator->currFile
  if (sf != null && time != null) {
   success = false;
   trans = sf.begin(myIdentifier);
   try {
    for( d in sf.children ) {
     assert d instanceof viper.api.Descriptor
     r = new viper.api.time.InstantRange()
     r.addAll(d.validRange.intersect(time))
     d.validRange = r
    }
    success = true;
   } finally {
    if (trans != null) {
     if (success) {
      trans.commit();
     } else {
      trans.rollback();
     }
    }
   }
  }
 }
 String getScriptName() {
  "Crop Descriptors to Match Current Time Selection"
 }
}

- posted by David @ 5:43 PM

Improved scripting in ViPER

So, I've noticed that people want more generic abilities to extend ViPER. Also, I like my Mac. So, I recently added the 'scripts' menu to viper-gt. The current version supports running arbitrary files from your ~/.viper/scripts/ directory. The new version will continue to support that functionality, but if the file name ends in .groovy, it will try to load the file as a groovy class that implements the edu.umd.cfar.lamp.apploader.AppLoaderScript interface. This is a more generic scripting method, and allows the script writer access to the internals of the program.

The AppLoaderScript Interface

public interface edu.umd.cfar.lamp.apploader.AppLoaderScript {
	public void run(AppLoader application);
	public String getScriptName();
}

And the following is the Add I-Frames script. To add it, save it to 'InsertIframesDescriptor.groovy' in your ~/.viper/scripts directory.

InsertIframesDescriptor.groovy

import edu.umd.cfar.lamp.apploader.AppLoaderScript
import edu.umd.cfar.lamp.apploader.AppLoader

import edu.umd.cfar.lamp.viper.gui.players.DataPlayer

import java.util.Iterator

import viper.api.Config
import viper.api.Descriptor

class InsertIframesDescriptor implements AppLoaderScript {
	static final String DESCRIPTOR_NAME = "I-Frames";
	void run(AppLoader application) {
		// The mediator bean is an instance of ViperViewMediator.
		// For information about the loaded beans, see gt-config.n3.
		mediator = application.getBean("#mediator")
		insertIFrameDescriptor(mediator)
	}
	String getScriptName() {
		"Insert I-Frame Descriptor"
	}

	/**
	 * Creates a new type of descriptor, called 'I-Frames',
	 * if it doesn't already exist. Otherwise, locates it.
	 * @return the I-Frame descriptor config information
	 */
	Config insertIFrameDescriptorConfig(mediator) {
		V = mediator.getViperData()
		c = V.getConfig(Config.OBJECT, DESCRIPTOR_NAME)
		if (c == null) {
			// Don't create the config if it already exists.
			c = V.createConfig(Config.OBJECT, DESCRIPTOR_NAME)
		}
		return c
	}
	
	/**
	 * Inserts a new I-Frame descriptor into the currently selected
	 * file, using the mediator's current DataPlayer object to  
	 * find where the iframes are.
	 */
	insertIFrameDescriptor(mediator) {
		c = insertIFrameDescriptorConfig(mediator)
		sf = mediator.getCurrFile()
		if (sf != null) { // check to make sure a media file is loaded
			allIFrames = sf.getDescriptorsBy(c)
			d = null // don't create duplicates
			if (allIFrames.hasNext())
				d = allIFrames.next()
			else 
				d = sf.createDescriptor(c)
			iframes = new viper.api.time.InstantRange()
			p = mediator.getDataPlayer()
			frameSpan = p.getRate().asFrame(p.getSpan())
			for ( f in frameSpan ) {
				if (DataPlayer.I_FRAME.equals(p.getImageType(f))) {
					iframes.add(f)
				}
			}
			d.setValidRange(iframes)
		}
	}
}

I've been doing this on the main branch, not on the beta 9 branch, so it won't be released for a while yet. If anyone needs this sooner, let me know. I'm doing some heavy refactoring of the timeline on the main branch, so merging the changes back to say, beta 9.6 would be pretty easy.

Archives