Show me the content!
As I’ve mentioned in my last posting (which was so long, you’re probably still reading it,) a key to both SEO and keeping users coming back for more is having lots of content that people are interested in. After talking one on one with some friends who were excited about the concept of the site, yet didn’t actually use the site, this priority was validated as the single item which needed the most immediate attention.
KitchenPC will most likely iterate and become one of many things depending on the protracted and painful process of market validation, but of all things what is certain is that as long as the mission of KitchenPC is to be a recipe website, having lots of recipes is kinda important. I’m also placing a huge bet that a fully normalized and relational representation of recipes is not only something that hasn’t been successfully done before, but is something that if done correctly, can unlock both features and revenue models that are not possible with traditional recipe websites.
I’ve determined three possible ways to get content on one’s site. I’ll discuss the pros and cons of each way, and what I’ve learned from trying.
Crowd Sourcing – Users will just enter content, right?
Yea, right. Crowd Sourcing is the technique that led to the success of Internet megaliths such as IMDB and Wikipedia, however this is a strategy that is, in my opinion, only applicable to maintaining momentum once you have it. In other words, once (and only once) you have a massive amount of content that could draw a multi-million eyeball user base, could crowd sourcing be used to achieve consistant, reliable and up to date data. There have been many studies on the “Wiki” phenomenon, which hypothesizes that when you allow a crowd to maintain a large set of data which can be modified by anyone, the “good” data will win out against the “bad” data. I’m applying this strategy to my own website, allowing anyone to edit any recipe. However, with that said, when a website is in its infancy, crowd sourcing is most likely not a viable approach for generating initial user content. People don’t just magically appear to go do work for you, however they love to be part of a larger cause where their work will be seen by millions. If KitchenPC takes off, I think the “quality” of my recipes will improve due to crowd sourced efforts, however from what I’ve seen so far, almost no one has decided to enter a recipe voluntarily on the website. It could be that they don’t do this because it’s a complete pain in the ass right now, but that’s a subject for another post.
Just pay people!
This is a great approach if you have deep pockets or have significant funding. I’m sure if I had a $50,000 angel investment, I could hire a team of skilled workers to transcribe recipes by the thousands, maintain the ingredient database, and manufacture a large database of awesome recipes. However, I have a pretty limited budget as I’m boot strapping this company from my savings account.
My initial approach to doing this was a very naive one. I used the website getacoder.com to post a contract for a single data entry person to type in 10,000 recipes. For those familiar with the site, the majority of bidders are from developing countries and will work for extremely cheap (by U.S. standards.) However, on this particular site I’ve found most bidders just copy and paste the same canned responses into bids, don’t really investigate the project in detail, and underbid just to get their foot in the door. You’re lucky to get someone to even start on the project, let alone finish. I learned pretty quickly I had to break up the project into smaller amounts of work; such as 1,000 recipes. Unfortunately, the bids for 1,000 recipes are not 1/10th of the price of 10,000 recipes. They’re actually about the same. I hired about five people to enter 1,000 recipes each, and three of them never started or responded to any emails once I approved their bids. One of them entered around 290 and then quit. The other one was up to around 480 after three months of working. She would disappear for days at a time and then get back to work. Right now, I haven’t heard from her in weeks after sending several emails.
Another frustrating thing about getacoder.com is you have to put the money into an escrow account immediately, and when the coder fails to deliver, getting your money back can take weeks or months. I’ve found this company to be somewhat shady and unpleasant to deal with, and several people consider the company a giant fraud.
Recently, I’ve moved my outsourcing efforts to vWorker.com which is so far proving to be much better. Their site is faster and more responsive with a nicer UI and better tools. The feature that really sold me is the ability to accept multiple bidders on a single project. I posted a project to enter 1,000 recipes, and accepted my favorite 20 bidders, which “spawned” 20 new projects automatically that I can manage and escrow separately. Since my goal is to get 10,000 recipes on the site within the near future, I decided hiring 20 workers would be a good initial number to try. This requires thousands of dollars in escrowed cash, but I fully expect to get most of that money back as most workers will not finish this task.
Out of the 20 workers, five quit within the first few days. vWorker allows workers to quit within 24 hours and not suffer a bad review or have to go through mediation. Four workers I’ve escrowed money for, but after over a week have not responded, done any work, or even created accounts on KitchenPC. Two workers have created an account, but have entered zero recipes. I believe one of those two has been trying to enter recipes, but insists on pasting in ingredients thus all his recipes end up in the “manual approval” queue and not published on the site. Nine of the twenty are indeed entering recipes, which is far better than I had predicted. The top worker has already entered 203, which is fantastic! The other workers have entered 88, 69, 40, 37, 21, 17, 7 and 7. Luckily, there’s about five of these workers who I think are doing great work, enter accurate recipes, and are even submitting pictures for their recipes. I’m fairly confident that at least five of these workers will finish their 1,000 recipe requirement, and hopefully some will agree to enter an additional 1,000. I’ve found most of these people really want to do good work, and will learn if you spend the time to correct their mistakes and coach them. However, doing this has been my full time job for the last several days!
I’ve been trying a few motivational techniques as well to inspire these workers to do great work. First, I email the entire group daily and post the current scores. This allows people to see where they place in the group, and they get excited if they’re near the top. If one worker has only posted a couple recipes and someone else has posted dozens, they feel embarrassed and will work extra hard that night.
vWorker also allows me to offer “instant bonuses”, which is an amount of cash that is immediately placed in the worker’s account. During the first few days, I selected one worker who was doing a particularly amazing job and gave him an extra $50 bucks as a bonus, and congratulated him in the daily score email so that others would learn of his accomplishment and the reward.
A couple days ago, I decided to hold a contest to see who could enter the best “paella” recipe (to appease my friend who complained about the lack of this Spanish dish on my site) offering the winner a $10 bonus. Over 20 paella recipes have already been submitted, most with pictures and very carefully entered methods. I plan to hold a few more of these “contests” so that I can fill in recipes for missing tags and try to round out the database where I see various cuisines under represented.
Automated Data Entry
The third and final approach for content generation is to automatically import it from another source, or become a portal to existing content on the Internet while adding value through aggregation. There’s a ton of recipe sites that already do this, and simple consolidate recipes from other sources on the Internet.
I’ve worked thus far under the opinion that automated data entry is impossible. Importing recipes that simply have to be human readable would be incredibly easy, but KitchenPC has to index raw metadata allowing it to understand the inner workings of the recipe and how it relates to ingredients the user has to shop for, various pantry amounts, and other recipes that use similar ingredients. I believe fully automated parsing of recipes is technically possible, but doing this accurately is not something I’ve been able to accomplish yet or probably will be able to for years.
For these reasons, I’ve previously dismissed this idea as a viable approach to initial content generation. However, after months of frustration getting recipes manually entered and still only having barely over a thousand recipes, I’ve decided that data entry automation is worth checking into more. I’ve decided a compromise could be made, and maybe I can partially automate the importing of recipes while isolating out just the part that computers can’t do.
The first thing I did was obtain a collection of about 8,000 recipes in XML format. I chose this set of recipes since it was already partially normalized. The amounts and units were in different XML tags and could be extracted easily. However, the ingredient name and form are still highly variable, such as “packed brown sugar” or “fillet of salmon” or “your favorite fruit”. The core of the problem would be mapping the ingredient descriptions to valid KitchenPC ingredient entities.
I created a list of the distinct ingredients across the 8,000 recipes, ignoring things like spacing and case, which resulted in about 12,000 unique ingredients across the database. I uploaded this data to Amazon Mechanical Turk, which is a website that allows you to create small jobs for humans to do repetitively. This allowed me to get thousands of humans to map these ingredients to a list of “known KitchenPC ingredients” and get somewhat accurate results.
I decided to pay 3 cents per match, which the average human would take about 40 seconds to do each. Hundreds of people worked on this problem overnight, and the set was fully matched in around 15 hours. The results, however, were incredibly disappointing.
The root of the problem is people do not have the interest of data quality in mind when the more matches they do, the more money they get. While there were thousands of perfect matches, the dataset was literred with randomized answers. For example, “chicken breasts” would get matched to “Baby Ruth Candy Bar”. Several hundred ingredients got matched to Cinnamon Toast Crunch cereal. We found two workers who matched over 1,000 items each to the first item on the list, which leads me to believe that people have written automated scripts that accept this jobs and submit random answers. Even those who tried would often be lazy, matching things like “white rice” to “rice vinegar” as they’d just search for the first match with the main word.
I cringed at the thought of weeding through 12,000 results by hand to remove the bogus entries, and was about to chalk it up to experience and just pay everyone anyway ($400 out the window!) The thought of paying all these idiots that just wanted to game the system for a quick buck really made me ill, but the idea of bulk rejecting every answer would be unfair to the majority of workers, who did put in valid answers. There had to be a way to somehow clean up this data set using statistics and assumptions about human behavior; I really wish I had Steven Levitt on speed-dial, he’d have loved this type of problem!
Luckily, a friend of mine offered to lend her Excel experience to try to devise a way to sniff out the bad answers using some creative data pivoting and grouping. Mainly, she looked for ingredients that got “picked” more times than usual, and also for users who submitted the most amount of work. Going through about 600 users was considerably easier than going through 12,000 matches. We could look at a single worker, one at a time, and quickly see all their answers and could make a decision within seconds if they were mostly bogus, or that user was actually trying. With each user, we would either bulk reject or bulk approve all their answers at once. My friend spent probably around 4 or 5 hours working on this spreadsheet and emailed the results back to me with the bad answers rejected. I definitely owe her dinner for this one!
I also have to hand it to Mechanical Turk for a pretty awesome user interface. I could download the results as a CSV file, open it with Excel and put x’s in the approve or reject column, and upload the file back to Amazon to process.
The problem with this approach is it’s unfair to workers who did a bad job at matching, but did manage to get a few right answers. Since we paid for all or paid for nothing, this was the compromise we had to make. This did result in me getting several nasty emails from these workers wondering why their results were all rejected.
I’ve learned some valuable lessons experimenting with Mechanical Turk. First, while it’s a good tool, assume around 25% of the results you get from it are going to be completely bogus. One way to work around this is to assign each HIT to two workers, and only accept the HIT if the answer is agreed on by both random parties. This, of course, means you’ll pay twice as much money for the results. Also, since KitchenPC matching is somewhat open to personal opinion (if a recipe called for Apples, I have about five different varieties so I instructed workers to just pick their favorite or most common) so this could severely limit the chances of consensus. Another approach is to “pre-screen” workers by only letting approved workers work on your HITs. One way to pre-screen workers is to issue a test which they must complete successfully before they could work. The test might contain a few “tough” matches, or simply test their culinary knowledge in general. However, this would limit the scope of workers who could work on each batch, thus slowing down the delivery of the results. I had about 600 people work on this set at once and it still took 15 hours. Plus, scammers could still game the system by answering your test and then still bulk submitting bogus answers.
Any way you look at it, the results generated by Mechanical Turk cannot be trusted as an accurate mapping to bulk import thousands of recipes.
However, I believe the results I got from Mechanical Turk (around 9,000 approved mappings) will still be good for building a solution to import recipes easier. My idea is to import each recipe one at a time, however the Turk data will be used to select a “default mapping” for each ingredient. This would save me the time from having to map ingredients for each recipe and search for it in a list, as I would only have to quickly glance at the default choice and make sure it’s right. If it was not right, I would change it and that choice would become the new default mapping. Another approach would be to go through the top 1,000 or so most common ingredients in the set and map them by hand, and then I could bulk import any recipe that used only these “blessed” ingredients. Using this technique, I believe I could import a few hundred recipes per day which is better than nothing.
I’ve learned that no one solution can be applied to these sorts of problems. I think my quest to pull content into the website will be accomplished by a combination of many techniques in parallel, each with its own strengths and weaknesses. I’m also not hugely worried about data quality, as I’m taking the approach that quantity is actually better than quality in this particular case. Quality can be improved over time through editing of each recipe, and the recipes that are the most accurate and complete will “bubble” to the top of search results with higher ratings.
AllRecipes has several hundred thousand recipes, and most likely they have their fair share of total crap recipes too. However, you never see them because they’re buried down on page 47 of your search results.
Hopefully, my experiences will help someone who’s also looking to generate initial content on their site. Cheers!