Skip to content

What’s Cooking? (Part 1) – Recipe Crawler

May 9, 2011

My long time readers might remember a post titled “Show me the content!” where I talked about various ways to grow content on a content-centric website such as KitchenPC.  Crowd sourcing doesn’t work until you first attract the crowd through great content.  Hiring people is slow and requires a sizable budget if quality is your goal.  Automated data aggregation has vast technical limitations, and has been something I’ve considered to be the Holy Grail of content generation techniques.  In fact, this is the one I’m going to talk about today.

Simply crawling every page on various recipe websites is quite easy, and I’ve already prototyped scripts that do just that for several major recipe sites.  If my goal was to simply bring in human readable recipes into my site, I’d be done.  However, KitchenPC stands apart from other recipe databases in its unique way of indexing and understanding the relationships between recipes and their ingredients.  The meal planner can take input such as a pound of cheddar cheese and locate recipes that use cups of shredded cheese, knowing how much of that pound will be used up.  Shopping lists can be generated and forms can be converted and aggregated between how the ingredient is used in the recipes and how it’s sold at your local grocery store.  For this reason, recipes must be stored and represented in the database in a highly normalized and relational fashion.  This allows KitchenPC to have some really kick-ass features, but also makes importing recipes from other sites incredibly difficult.

At the core of this hypothesized recipe crawler would be a natural language parser that could understand and decompile the various recipe ingredients it came across.  A recipe might call for “1/4 cup cheddar cheese” – even though you can identify the amount (1/4), the unit (cup) and ingredient (cheddar cheese), a computer has a lot more trouble with this problem.  How about “a ripe banana”?  In this case, an article (the word “a”) is used in lieu of the number 1, and the ingredient “banana” is qualified with the adjective “ripe.”  A parser must understand that grammar and interpret this as “1 banana”, and use the word “ripe” as a prep note for the reader rather than a type of banana sold in stores.  Ingredients, of course, can exist with various synonyms as well.  “plantains” often, in the United States, refers to the common banana, and don’t get me started on all the various types of berries.  Ambiguous quantities such as “one or two tomatoes” of course present further challenges to a parser, as well as combinations of ingredients such as “salt and pepper to taste.”

Over the past few days, I’ve been exploring various NLP technologies for .NET (such as NooJ) and attempting to design a working prototype that can convert at least the most common ingredient usages to their correct KitchenPC normalized form.  More importantly, this algorithm has to know when it’s right and not import any data incorrectly.  If I had an algorithm that could understand 90% of the recipes it finds online and just skip the ones it isn’t sure on, I’d easily be able to import hundreds of thousands of recipes.  However, I don’t want to import a hundred thousand recipes if 10% of them have errors.

Most of the NLP engines I found were either too expensive or were overkill for what I really needed.  In the end, I decided to build my own solution from the ground up.  After several long nights, I eventually came up with a solution that I’m really happy with.  The grammar is completely abstracted from the vocabulary, so I can teach it various ways to parse ingredient usages while being language agnostic.  Right now, the algorithm is like a small child that asks its parent when it doesn’t know something.  It can understand basic phrases and common descriptions of ingredients, such as “a cup of milk”, since the word “cup” is recognized as a unit, “a” is recognized as an amount, and “milk” is recognized as an ingredient.  The grammar “amount unit of ingredient” is a known way to express an ingredient (I call these templates, and there are dozens of them.)  However, if the engine runs into a word it doesn’t know, such as “a head of lettuce”, it might ask “What’s a head?  And how does this relate to lettuce?”  I would then have to explain that, in this context, a head is a unit for this particular ingredient, and link it to the proper forms row in the KitchenPC ingredient database.  From that point on, the engine would understand that and be able to then parse “3 heads of lettuce” or “lettuce: 1 head” on its own.

The next goal will be to marry my recipe crawler and the NLP engine, providing this young child with a playground to learn and explore.  Whenever it runs into something it doesn’t understand out in the real world, it would record this inquiry in a database and I could go in and answer its questions.  Each time an entire recipe can be be understood, the data would be collected and imported into the KitchenPC database.  Eventually, the grammar and vocabulary would develop from that of a young Kindergartener into a well-spoken college graduate (hopefully one that graduated from culinary school!)

At this point, I will begin phase two; incorporate this technology to benefit the KitchenPC website itself.

NLP-based User Interfaces

You’ve already seen a few of these around.  If you email your friend and say “Want to get lunch tomorrow at 2pm at McCormick & Schmick’s?” – GMail will have a link to add “lunch, 2pm, McCormick & Schmick’s” to your Google Calendar automatically.  If you’ve uploaded your résumé to most of the major career websites these days, your résumé is parsed and employment history and skill sets are abstracted so potential employers can search by those fields.

Being able to understand ingredient expressions can pave the way towards rich, intuitive user interfaces on KitchenPC.  Rather than having to select each ingredient from a dropdown menu as it’s entered into a recipe, a user can simply copy and paste all the ingredients at once.  Any ingredients that are not understood would be flagged for the user to take action, or perhaps ignored so someone could take care of those ingredients manually on the back end.  One of the big changes to the “New Recipe” page will be the ability to just paste in a URL from another website, and KItchenPC will fill in everything for you.  This, of course, is possible with the magic of microformats, which most every website these days supports.

This NLP engine could also be used to improve the shopping list and pantry.  Users could just “type” their shopping list in one big text field, or add “an orange” to the pantry.  Want to add a recipe to your calendar?  Just add a URL instead, and the recipe will be imported into KitchenPC and scheduled on your calendar in one click.  Now you see why this technology is so important to a site like this.

User interface components that use natural language processing would also be a bit more “lax” on an exact match.  They could make intelligent guesses (in the event of only a partial template match) and then ask the user to double check to make sure the values are correct, where-as the web crawler would only import the recipe if it were 100% sure it understood the ingredients mentioned.  The code is designed to allow for exactly this sort of behavior, since the web crawler would employ only a subset of templates than the website user interface.

So when can you expect all this?  Hopefully soon!  The core engine is just about done, however it will take many hours of hand holding to expand its vocabulary.  I believe I will be able to import a few hundred recipes in the near future, and then this number will grow exponentially as its able to understand more and more recipes.  I believe in a matter of months, I will potentially have a hundred thousand recipes in the database.

Why even have a recipe database anymore?

The more clever among you may be asking this question.  If KitchenPC can understand virtually every recipe on the Internet, why maintain my own proprietary database of recipes?  Why not apply meal planning and scheduling to the plethora of recipe data already out there on the web?  In a sense, KitchenPC would turn into a recipe search engine more like Google (albeit, an incredibly advanced search engine with meal plan optimization and scheduling built in.)  Recipe data would of course be cached locally (just like all search engines do) and said data would be more transient in nature, updated from time to time.  Meal planning features and a quick-view of the recipe would be available through my site, and users can of course click through to the credited site if they wanted to.

This is one of the big pending decisions I foresee in my future.  I’d love to not have to worry about my own data, and simply provide a value added experience to Internet-based recipe searches, but at the same time I’d like to do this in a way that still allows users to upload their own private recipes and manage their own personal collections.  I think what I end up doing will be a mixture of both, but we’ll see how it pans out and where the line is drawn.  The one thing I do know for sure is this NLP parsing technology is one of the major pieces missing to really expand the site into something incredibly valuable, thus getting there is technologically my top priority at the moment.

When the crawler is up and running, I’ll be sure to share some screen shots of how it works and what the import process looks like.  I must say, this is perhaps some of the coolest code I’ve written and definitely much more interesting than anything I wrote during my twelve years at Microsoft.


From → Business, Technical

One Comment

Trackbacks & Pingbacks

  1. What’s Cooking? (Part 2) – Mobile Apps « KitchenPC

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: