The Itsy Bitsy Spider (Part 2)
Over the past couple weeks, I’ve had a spider crawling around the Internet (for some reason, I imagine this spider wearing a chef’s toque) looking for recipes to download. Last I checked, it had downloaded just under half a million recipes totally around 48,000 megabytes of data. However, as I’ve mentioned before this is a raw HTML response from the web server. Extracting the actual recipe data from it, and normalizing it into a form which KitchenPC can work with, will be the next major challenge. That’s where the indexer fits in, which I’ll be talking about today.
In principal, the indexer is quite simple. It queries the database for a list of pages that have not yet been linked to a produced recipe ID. It will load a thousand pages at once, as not to use up too much memory but also avoid having to issue a new SELECT for each page, as the indexer can process around 20-30 pages per second.
Once the HTML is loaded into memory, the first job is to parse it; similar to how a web browser would parse HTML into a Document Object Model. In the .NET world, the standard parser for HTML is the HTML Agility Pack. Honestly, I find this parser rather disappointing, especially compared to the raw awesomeness of BeautifulSoup (the Python based HTML parser I use for my crawler.) The library is stable, but hasn’t really been updated in over two years, and the documentation is almost non-existent. The interface is all XPath based, and poorly implemented at that. It’s difficult to find nodes partially matching what you’re looking for, or containing a specific CSS class.
To get around this, I wrote a series of extension methods for the Agility Pack that allow me to recursively walk each node, run an anonymous function against it, and return a collection of nodes where this function returns true. I then created wrappers around this method for common tasks such as finding a node below a parent node that contains a certain class.
I’m hoping these improvements to Agility Pack can someday be released back into the community, as I think they greatly improve the library. Though this would probably be done as an extension DLL or be directly compiled in to the main library, part of me wants to write a full .NET port of BeautifulSoup, which might be a bit challenging due to Soup’s pythonic architecture.
Once the DOM is parsed, I can then start extracting hRecipe specific content. This is where things get difficult.
Where Things Get Difficult
The hRecipe Microformat spec is a complete disaster.
A major issue is most sites only bother to embed these microformat tags so Google will display rich snippets for their results. This is made evident by the fact that these sites will only implement enough of the spec to satisfy the information on the rich snippet. The recipe title, description, prep/cook times, and rating will be implemented perfectly, and the method, ingredients, and other more detailed aspects of the recipe will be lackadaisically slapped together. For example, the hRecipe spec calls for the method to be tagged with the instructions class, but AllRecipes chooses to use directions instead.
A lot of the tags allow for arbitrary text. For example, the yield tag which describes how many servings the recipe makes. Many sites use difficult to parse text here, such as “a whole cake” or “3 to 4 servings”. In other words, there is no standard way to express the servings, which KitchenPC uses to allow serving size adjustments in the UI.
The hRecipe spec also lacks standard fields for ratings, comments, or specific durations. You might be wondering then, how Google will show this information in their rich snippet. Well, that’s because Google has proposed their own additions to the hRecipe standard. The Google additions allow specific cook times and prep times, as well as review information such as total votes and average rating. Many of Google’s additions are also supported by ZipList, which also has its own interpretation of the microformat.
Due to Google’s weight, most sites will craft their content in a way that will appease Google, thus I took the route of modeling my parser after theirs. With any luck, these additions will eventually become part of the official hRecipe spec anyway.
if (Site == “AllRecipes”) { //Le sigh
The good news is that if your content is perfectly hRecipe compliant (an emphesis on if as I’ve yet to see this site,) my parser will be able to extract recipe content perfectly. Perhaps Opera should get into the recipe site business.
In reality, there was absolutely no way to avoid special casing all the major recipe web sites to handle the random chaos they threw at me.
To solve this problem, I decided to implement a flexible architecture that allowed me to override individual behaviors while parsing any site. I first wrote a base class called hRecipeParser:
class hRecipeParser : IParser { // Parses HTML exactly to hRecipe spec }
This class has various virtual methods such as ParseTitle, ParseServings and ParseRating. The base implementation will parse these properties from the HTML per spec, however this behavior can be overridden by a subclass. That way, I can subclass the hRecipeParser class for other sites that conform to this standard:
[Parser("www.food.com")] class FoodDotComParser : hRecipeParser { protected override byte ParseServings(HtmlNode document) { // Food.com hides its serving size in a hidden input tag } }
I also took the approach to build a ParserFactory. This method can look at a Page record in the database, figure out the base URL the data came from, and then find and instantiate the correct parser subclass, if available, using reflection. For example, if the Page URL was http://www.food.com/Recipes/Yummy_Cake/, the ParserFactory would look at the base URL, http://www.food.com, and search for an IParser implementation that is tagged with the attribute [Parser(“www.food.com”)]. If there is none, it will return an instance of the default parser. This allows me to quickly add new parsers as I start to crawl more and more sites.
Enter Chef Watson
The real work that goes on with recipe parsing is parsing the actual ingredient usages. Compared to that, everything else is pretty straight forward.
This is, of course, done with my natural language parser which also powers a lot of the new KitchenPC UI. Parsing through literally millions of ingredient usages is the final test of Chef Watson’s culinary knowledge, so it was quite important to be able to log its progress, catalog its errors, and quickly be able to prioritize where it can be improved.
Every time the indexer comes across an ingredient usage it doesn’t understand, it logs it to a database table called ParseErrors. This table contains two text fields. The first field is the exact text that couldn’t be parsed, such as “5 large watermelons“. The second field is a hash of that text, with irrelevent data such as numbers stripped out, i.e. “# large watermelons“. This hash is very important when it comes to figuring out how to improve the NLP vocabulary. Obviously, fixing “5 large watermelons” would also fix “2 large watermelons”, thus the number is converted to a generic pound sign. This allows me to run queries, such as:
select hash, count(*) from Indexer.ParseErrors group by hash order by count desc limit 1000
Which would give me the top 1,000 missing ingredients. Obviously, I don’t want to spend time fixing ingredients if they’re only used by 1 out of 100,000 recipes.
The indexer also updates the Pages table with relevent information on that page, such as the number of missing ingredients (which would allow me to see how many recipes I can almost parse) and any runtime exceptions that occured when parsing that page, including a full stack trace.
When a recipe can be parsed, a KitchenPC recipe object is created in the main database, and the Pages table links to that.
So Run It Already!
I decided to test it out on the first 10,000 pages in the database, just to see how I was doing so far. The results were quite disappointing. Most of the pages were from AllRecipes, as that was the first site I crawled. Out of 10,000 recipes, only 1,504 of them could be fully parsed. I guess I had a long ways to go.
One interesting trend I noticed was the top missing ingredient was the text “ ” – That’s right, the HTML non-breaking space entity. This was found 864 times within 10,000 recipes. Apparently, AllRecipes has random “empty” ingredients in their HTML which are of course tagged with the hRecipe Ingredient class. Obviously, I would have to sanitize this input and strip out these sorts of null ingredients. Sigh.
After that, I spent quite a bit of time improving logging code, error catching, and also re-factoring the code to allow it to load 1,000 recipes at a time, as obviously I don’t have the memory to load in 48 gigs of HTML from the database at once. It was finally time to let it run on the entire database, without limit. This took somewhere around 7 hours (I’ll add timers for the next run so I can get an exact measurement).
These results were even more depressing. The successfully parsed recipes was hovering slightly over 3%, with 1,954,836 ingredient parsing errors. There’s a few theories for why this was much less successful than the 15% figure from the first run. The first is my NLP vocabulary was initially built around sample data from AllRecipes, so obviously Chef Watson is very good with the types of ingredients AllRecipes might have. However, AllRecipes only accounts for about 10% of my total cache. The second theory is that something is going terribly wrong parsing Food.com recipes, which is a massive chunk of the database. Perhaps they stick weird HTML codes in their ingredients list, or some sort of recurring exception keeps happening. With any luck, I can make some small adjustments that will yield a much higher success rate, after a little detective work.
No Blog Post Would Be Complete Without a Top 10 List!
Now that I’ve attempted to parse nearly half a million recipes on the Internet, I figure it would be fun to include a list of the top ten hashed ingredient usages that Chef Watson was unable to figure out.
# | Hashed Usage | Times Found |
1 | salt and pepper | 18,107 |
2 | salt | 15,601 |
3 | # garlic cloves, minced | 14,678 |
4 | # large eggs | 10,349 |
5 | pepper | 7,822 |
6 | # garlic clove, minced | 6,315 |
7 | # garlic clove | 4,532 |
8 | # medium onion, chopped | 4,482 |
9 | # large egg | 4,131 |
10 | # teaspoon fresh ground black pepper | 3,688 |
A lot of these are pretty easy to fix, such as large eggs and garlic cloves. However, the pattern you’ll notice is a good chunk of these ingredients have no amount. For example, salt and pepper or just pepper. Unfortunately, KitchenPC needs amounts for ingredients, since the whole system is built around that concept! I think I need to spend some time re-considering this limitation, as I might potentially be missing out on tens of thousands of recipes if I can’t address this issue. One possibility would be to special case spices and be able to handle these “to taste” usages.
There’s currently 68 ingredient hashes that occured over 1,000 times and well over 1,000 hashes that occur over 100 times. What truly scares me is the 656,528 hashes that only occurred once. Surely, fixing all those by hand would be an exercise in data entry futility!
So Now What?
It will be interesting to see the returns in success rate after fixing, say, the top 100 or even top 1,000 ingredient hashes. It’s tough to say where the point of dimishing returns will be; that point where continuing to fix parse errors will not be worth the return in newly available recipes. I’m also hoping that there’s some grander architectural changes I can make to the indexer itself. For example, a way to ignore bogus data that doesn’t really mean anything. I’m still hoping for that one easy fix that causes the success rate to rocket up 20% or so.
My goal is to get 100,000 recipes in the database for launch, which would mean a success rate of about 20%. Right now, that seems like a tall order.