(This post is a continuation of my prior posts on the transition from Riya 1.0 to what comes next. The previous episodes are:
8 , 7 , 6 , 5 , 4 , 3 , 2 , 1 .
Episode 9: Aug 1 to Aug 30: User Tests Decide (Part 1)
Aug was integration month. By the end of the month we had wanted to get visual search for faces out the door. While the algorithms were in good shape, we now had to integrate them with the almost 45MM faces we had crawled from Myspace.
The data handling and cleansing was a beast and consumed a good part of the month. The algorithm would guess ethnicity of the person, but if the MySpace profile had this information we would default to that, but it turns out that people tended to lie as a joke, so that had to be adjusted. Millions of photos were just too blurry and had to be thrown out. We built up a large QA team which Sowmya and Dinakar helped to quickly staff and run to find all of issues.
Vincent and Nikhil worked to the bone. I suspect there was a week or more there where they worked 20 hours a day. They never complained and just got it done. Even more so I sensed a tremendous pride. They really wanted to make this work.
On Baris’s team things were ahead of schedule in the sense that they had a working version of our product similarity search that had a 250K unique SKUs in it. It wasn’t production quality like the face stuff and still would need more work, but it was something we could test on users.
Karl our aggressive, focus, and prodigious researcher brought in almost 60 Myspace users between the ages of 16-25 and showed them the MySpace search tool and the shopping tool. The comments were interesting.
a) People were initially excited by the concept of face similarity but quickly thought it was more of gimmick that they would use very often.
b) On the shopping side they (especially the girls) we especially excited by the concept and said they would use it all the time as there was not real alternative to shopping visually.
Karl asked users to do 30 search of each type and rate them on a five point scale (with five being the best). The results were very surprising:
a) Average score for face similarity 2.59 out of 5
b) Average score for shopping similarity 4.0 out of 5
Huh? Why did faces score so badly? Vincent believed he knew why. He wanted another week to tune it and clean up the system. We agreed on the extra time (and hence expense) but I was firm that it would not ship unless it got to at least 3.75 or higher.
I sent an email to the company stating that our launch criteria was at least 3.75 (although I really wanted a 4 – but I was practical enough to let it drift upward over time) for any similarity product we launch.
While Vincent was tuning the system, I had Karl run one more test. I asked him to do what I called the “Theoretical Max test”. He hand built “perfect” face similarity results himself (not using the computer) and asked users to score them on the same level. It is logical to assume these human built search results should yield a score of 4.5. They didn’t. Karl theoretical test yielded only a 3.5.
What an interesting mystery. I could only theorize that either tens of millions of years of biology had perfected our visual cortex (of which a large part is dedicated to faces) to be so good or that face similarity is not only subjective but also arbitrary. In fact Karl’s tests supported the 2nd as the exact same result some scored a 1 another person scored a 5.
It was fascinating seminal research. Dan and Vincent came up with a neat UI trick that morphed a two faces to show you how things were similar and we all felt this simple item might highlight the similarities in a way that improved the scores.
Finally, Vincent was ready for the re-test. Karl brought in another 50 people. All day everyone was bugging Karl for the “exit polls”. However, I instructed Karl to hold the results until he had time to double check and triple check his work and then publish them. The suspense in the office was incredible.
Finally the next day at around 2pm Karl was ready. We had improved to 3.54 but not gotten to 3.75. Arrghhh… what a bummer.
The scores did improve but not be enough. Irconically they did beat the theortical max a little bit (we assume just some sampling error in one or the other) but what this meant was they did as good as a human, which for technology results is awesome. However, for a great product you have to meet or exceed expectations of users not some technical milestone.
The face team was disappointed but sense was they understood the decision making process and deemed it rational. This is always key for an executive when a team works this hard and it passionate. If instead the decision to launch was just take by an exec arbitrarily based upon his own opinon you would see a loss of confidence (correctly so in my opinion).
Vincent still believed he know of a fix that might improve the results further, however, it was going to take 4 more weeks to fix this. Thus we didn’t can the whole face project but decided to put it on ice until after shopping goes out and re-evaluate at that time.

Comments