Peter Forret, asked a question here about if our technology could identify objects. This is actually quite difficult as most objects are concepts not an single item. Think of a chair, there is not one chair but rather thousands of different types. This is way beyond the ability of technology today. On the other hand think of an iPod (forget the mini and shuffle for a second). There is only one way an ipod looks from the front. Given a photo of an ipod we can find similar photos, but we can't necessarily identify that it is an ipod. So technology today can do object similarity search but not necessarily object recognition. For many applications, we have found similarity is sufficient for narrowing your search to a small manageable set you can quickly browse.

Reminds me of the time I was working in speech-based biometrics: the difference between speaker identification (who is speaking) and speaker authentication (is it really that person). They use similar kind of technologies: the 1st must look swiftly through a vast library of voice profiles and pick the correct one (so optimize for speed), while the 2nd just has to compare a voice sample with one 'model' and come up with a match score and decide whether it is the right voice (optimize for equal error rate).
The 'similarity' technology you describe actually sounds more like the first one. So I can imagine you guys coming up with a library of easy-to-recognize icons (an iPod, a bicycle, the Eiffel tower, Mickey Mouse) with associated tags.
If there's any chance of getting into your beta program, I'd be delighted :-) This is really interesting stuff!
Posted by: Peter Forret | August 28, 2005 at 04:16 AM
Peter - we would love to have you in the beta - just send an email to beta at ojos-inc.com
Posted by: Munjal Shah | August 28, 2005 at 11:06 PM
Content based image retrieval using computed similarity metrics can provide a decent starting point for either follow up tagging (by humans), or as a way to dredge up a collection of otherwise unlocatable scenes, again for review by humans.
Some of the major weaknesses of systems implemented in the past have been reduced, due to the commoditization of computing and storage resources. More interestingly, the commoditization of communications cost to the "end user" opens the possibility of human user feedback to the image search system that hasn't been usefully present in the past, by aggregating both the image search click throughs and the clusters of associated tags accumulated over time.
At present, to me this makes the most sense as a service targeted at topical communities of users, rather than on my desktop or on my hosted photos only. I'd love to have my photos magically tag themselves based on who's in them, which isn't practical unless I'm Paris Hilton (and thus hang out with other well-known, recognizable people). It seems reasonably possible to identify an interesting set of landmarks and buildings, and perhaps a number of celebrity faces.
It didn't sound like you were trying for the Visionics-style parametric face recognition. Even without it, I you could probably generate a page full of photos generally similar to, say, Elvis Presley, or photos that might be of (name-your-favorite-celebrity-here), which could then be more or less voted on (by clickthroughs and tagging) by a community of users forming an opinion about how well the images met their expectations.
The stock photo / video / audio agencies have been grappling with the tagging / keywording problem for a long time, typically ending up with a small set of people who know particular collections and associated keywords.
Image content-based search can be a useful starting point, but the 2-way web, incorporating the users collective knowledge is just becoming possible and could really make things interesting.
Posted by: Ho John Lee | August 28, 2005 at 11:21 PM
Exactly. All of the computer vision technologies alone can only do so well. User tagging alone doesn't scale (we are just too lazy to do it for the tens of thousands of photos we have). If you can marry computers to do 80% of the work and users to just fix errors and handle exceptions you have a very powerful solution. This is what we are shooting for.
Posted by: Munjal Shah | August 30, 2005 at 06:04 PM
Much like the nuances of our gate affects the wear on our shoes. You leave an impression on the photos you take. Any algorithm for analyzing photos will involves configurations and thresholds. To meet the needs of a broad audience median values may be chosen for best results most of the time. There is another solution.
The computer alone falls short and tagging doesn't scale. I agree. The solution is to use the computer to analyze but get the user to provide frequent and minimal feed back.
Take the algorithm you employ and concatenate all the variables required to configure it. Convert this concatenation into a raw string of zeros and ones. This is the DNA of an instance of the algorithm. Ship the product with a population of a couple dozen individual strings of DNA. Any time the user does a search a random individual is chosen to configure the algorithm. Then monitor the user reaction to the search. If they click on a bunch of photos maybe conclude it worked. If they start a new search right away assume it was a failed search. Give that individual DNA an appropriate score. After some number of searches throw away the lowest scoring individual DNA. Then "breed" a second generation.
Take two strings of 0's and 1's, chose a splice location take the front of one and the end of the other. Sprinkle a couple mutations and continue until you have generation 2.
Over many iterations the population of DNA should move toward individuals that perform the best searches. This will be a configuration tailored to the user. The great thing is that if the users photo collection changes with time so should the search.
Posted by: Michael Artemiw | September 02, 2005 at 07:40 PM
Michael you seem to be describing a sort of genetic algorithm for taging/determining relevancy. We are exploring ways to learn from user responses to make the these better. This is an interesting approach. What I don't understand is exactly to what to encode and what is the objective function to us.
Posted by: Munjal Shah | September 03, 2005 at 09:37 PM
Munjal,
Sorry I wasn't clearer in my last post.
You want to encode all inputs to your algorithm except for the image. This includes all values that might currently be hardcoded into the code.
These would include values that are used in heuristic calculations. For instance, if you use some sort of threshold to guess whether a photo is taken indoor or outdoor, this threshold value should be encoded in the "DNA".
The concatenation of all these encoded values would comprise the "DNA".
The objective function is a little tricky in this case. You basically want to know if this instance of the algorithm did a "good" job.
In the most basic form you could ask the user if they liked the results. This could be a little tedious. Another solution would be to infer from the user's actions whether or not the results were "good". This would be more error prone but may average out in the long run.
What you might do is monitor what they do after they run your algorithm. If they immediately run it again you might infer that they didn't like the results. If they look through the results for a while and then select some of them then this might have been a good search. If they look through the results for a long time and don't select any, then maybe that is a bad search.
This is all theoretical, but I think you might have some success with this technique. I have some experience applying genetic algorithms to graphics algorithms, feel free to email me if you would like to talk in more depth.
Good luck.
Posted by: Michael Artemiw | September 04, 2005 at 10:48 PM
Michael - the encoding seems complex but the idea of infering if the system was correct from the users implicit action is of course very scalable input.
Posted by: Munjal Shah | September 12, 2005 at 10:28 PM
Your correct about the complexity. The upside is that this complexity is just tacked on the front of the software. It doesn't have to affect the development of your algorithms.
You can easily design the software so that you run it regularly or within the genetic algorithm container. The genetic algorithm container contains all the encoding complexity and hides that nicely from the rest of the software.
I look forward to reading more about your solution to this problem and your other adventures with Ojos.
Cheers,
Michael Artemiw
Posted by: Michael Artemiw | September 13, 2005 at 07:52 PM
hey tell me im i ganna fine ma lover
Posted by: anisa | April 20, 2007 at 09:27 AM