Skip to content

The Itsy Bitsy Spider (Part 1)

May 30, 2012

Spiders have existed for millions of years, as one of the most diverse members of the animal kingdom.  Relatively recently, their digital counterparts also started crawling over a different kind of web, attempting to find every last web page on the Internet.  This has been a tool used by search engines since the early days of the Internet, employing hundreds of thousands of servers which do nothing all day but surf the web.

Fortunately, KitchenPC doesn’t need to index the entire Internet, however as a recipe search engine, it needs an automated way to search for recipes online and cache them locally to provide users with the ability to locate recipes quickly and easily.

Since I’ve never worked for a search engine company, I had pretty much zero experience building a web crawler.  I made a few mistakes along the way, which I’ve decided might be helpful or at least interesting enough to write a two-part blog series on.

My implementation is divided into two separate processes.  The first process does nothing but crawl websites, slurping any HTML it can find into a database.  The second process processes that database and attempts to parse any recipes it can find out of that HTML data.  This post will go into details about that first process.

The First Attempt

Having very little experience in this field, I decided it best to use something already available and preferably open source to actually crawl websites.  Knowing I wasn’t the first one to have such a need, I looked into a few platforms that do this sort of thing.  The one I decided upon is called Scrapy, and is a Python based framework for writing fast, efficient and scalable crawlers.

Scrapy is very quick to get up and running, even if you have next to zero Python knowledge, and is also very well documented.  It has various behaviors built in, such as the ability to read sitemaps, follow any link it finds, and honor robots.txt restrictions.  Of course you can also override any of these behaviors, and use as little or as much of it as you’d like.

Scrapy has the ability to parse HTML and extract only the data it needs out of each webpage it finds.  That data can be serialized as XML or JSON, or just dumped out to the console.  One of the first mistakes I made was attempting to extract the actual recipe data at this stage.  The documented way of parsing HTML with Scrapy is using a built in DOM parser that offers an XPath style navigator.  Though this might be useful for crawling a site where the format is exactly known, I found this too limited in my case.

My case is somewhat special, as I’m not exactly trying to crawl a specific site in particular.  Instead, I want to crawl any site that exposes data using the hRecipe Microformat.  Web and SEO experts will know this is a special markup that allows parsers to extract specific data from a website, abstracted from the layout itself.  For example, certain sections of HTML can be tagged as a recipe title, an ingredient, a rating, etc.  If you’ve ever searched for recipes on Google and seen a result such as the chocolate chip cookie recipe above, this is because the site encodes its data in this format.

The issue is the hRecipe specification is unfinished, incomplete, and implemented differently on almost every site I’ve found.  It also creates various situations where a single XPath query wouldn’t be able to handle any valid hRecipe markup.  I’ll get more into this in Part 2.

So instead I ripped out the default DOM parser in Scrapy and decided to go with a more powerful one instead.  This parser might be familiar to other Python coders; BeautifulSoup.  BeautifulSoup not only has one of the coolest names, it’s hands down the best HTML parser I’ve ever used – on any platform.  Rather than static XPath expressions, you’re able to really craft an expression using the power of Python.  For example, if I wanted to find all hyperlinks with a Class attribute of Test that contained the word “foo” somewhere in the title, I could do something like:

soup.FindAll("a", new { Class = "Test", Title = new Regex("foo") }); 

BeautifulSoup is actually so cool, I’m tempted to write a .NET port of it if I ever find the time for such side projects.

No Soup For Me

Unfortunately, BeautifulSoup still had some issues.  I’m quite sure it would technically be possible to write an hRecipe parser in Python, but my Python is a bit weak and I kept running into little edge cases that were very difficult to work around.  For example, some recipe sites put HTML markup inside the recipe title, and this became hard to convert to raw text.

A huge reason why I halted efforts on this route was to avoid a Python reliance in the future.  Eventually, I’d like users to be able to add recipes to menus either by pasting in a URL directly or by using a browser extension that’s able to recognize recipe websites automatically.  Since I don’t want to have to call into Python code on the website, it became clear that I eventually needed to migrate my recipe parsing code to C#.

There’s also one other huge issue.  I was bound to screw up.  Crawling hundreds of thousands of webpages and extracting recipe information is something that takes weeks or even months.  Suppose I forgot one little piece of metadata, or later on I improved my parser to do x, y and z?  Surely, I didn’t want to have to re-crawl all those websites over again.  At this point, I made the decision to use Scrapy to simply store the entire HTML response rather than trying to parse anything at all.  That way I could handle the recipe extraction as a separate process offline, and redo it as many times as I wanted without re-downloading gigs of data from the Internet.

A Somewhat Working Design

I now had a design that worked pretty well.  Scrapy would crawl a website, following any link it could find, and whenever it came across a page with the word hrecipe in the HTML, it would dump that HTML to a Postgres database I had setup for indexing.

At least I could now start crawling, a process I knew needed to get underway as soon as possible.  I let the default Scrapy crawler run (which is able to download the Sitemap as well as follow any links it finds in the HTML) on a couple major websites.

After about two weeks running nonstop, I had about 20,000 recipes in the database.  That is until a bit of wind knocked out the power to my house.

Oh Noes!

I had one small design flaw with my crawler.  If it were interrupted for whatever reason, it had no way to resume where it left off.  I really didn’t want to waste another two weeks starting over again, so I decided to take that opportunity to improve the design a bit.

First off, the crawler was far too slow.  Looking at the output, I noticed that most pages the crawler was finding were not recipes at all.  For example, some of the more popular recipes would have 50 or 60 pages of comments, so I would find myself crawling URLs such as /Recipes/Foo/Comment.aspx?page=37 and what not.  Surely there had to be a better way.

Another huge issue surfaced when I started looking at the URLs in the SQL database as well.  Scrapy is smart enough to not crawl the same URL twice, but it’s only as reliable as the uniqueness of the URL itself.  Out of the 20,000 or so recipes I had, I noticed a lot were of the same recipe only with different URL parameters.  These URLs might be something like /Recipes/Cake.aspx?src=home and /Recipes/Cake.aspx?src=comment.  There were also some URLs that redirected to another already existing URL, and Scrapy would only know about the first URL.  Long story short, I had somewhere around 9,000 recipes that were a part of a duplicate set, and they were incredibly hard to get rid of.  I ended up just getting rid of these, as I didn’t trust the integrity of the data.

At this point, I came to the conclusion that deep-crawling any site was the wrong approach.  Though this might work well for a search engine like Google, which simply lists all the pages that match a given query, it does not work well for a recipe search engine where I want to display each recipe only once.

Finally, a Working Design!

Most sites export a list of their URLs through a Sitemap file located in the root directory.  Search engines use this file as a “starting point” to crawl a site, and also discover pages that might not be linked to from elsewhere.  Luckily, the major recipe websites provide a list of most, if not all, of the recipes on their site.

I decided to download the Sitemap files of several major recipe websites, remove any patterns that were not actual recipes, and then add this list to a Queue table in my database.  I just did this by hand, though it could  be easily automated later on.  Rather than follow any link found, I would now just query the Queue table.  This allowed me to create a SQL view called Pending, which returns the rows from Queue that do not have a matching record in the Pages table, meaning they have not yet been crawled.  I also return the rows in Pending in a random order, as not to pound on a single site too much at once.

I then modified my Scrapy script to SELECT * From Pending and start its work.  Now if the power went out again, the script could be resumed right where it left off!

100,000 Recipes Crawled!

100,000 Recipes Crawled!

This design was far more efficient.  In less than a day, I was already back at the 20,000 mark, which took me over two weeks to get to the first time.  After about a week or so, I had parsed over 100,000 recipes from various sites on the Internet for a total of about 16 billion bytes of data downloaded.

Finally, I had a working crawler that would efficiently gather recipes from the Internet, have the ability to start and stop at any time, and save HTML in a database to be parsed later down the line.

A Few Ideas

Obviously, this is just a rough prototype of a fully automated crawler that would require no human interaction and support any number of websites.  I already have a few ideas in mind that would make this crawler even better.

First off, the crawler needs to be able to re-crawl a URL after a certain period.  I might want to re-check a URL after 30 days or so, and if the recipe has been changed, import the new HTML and update the existing linked recipe on KitchenPC.  Right now, I store timestamps for everything so it wouldn’t be too much work to modify the Pending view to also return URLs that were already crawled, but have a LastCrawled date of more than 30 days ago.

Second, disk space is potentially an issue if I truly want to index millions of recipes.  Luckily, Postgres will TOAST these rows automatically, compressing the data if possible.  The 16 gigs I’ve crawled only take up about 7 gigs on the disk.  I’ve considered using a byte array in the database instead, and storing the HTML data as a compressed gzip stream, however I’m unsure if I’d get any better compression out of that.

Though storing the entire HTML has its advantages, it might be overkill especially once the parser itself is relatively stable.  I could potentially modify the crawler to only store the HTML related to the hRecipe microformat itself.  This would save considerable disk space, while still allowing me to re-parse recipes if my indexing code changed.

I’m also still worried about duplicate recipes.  One solution would be to implement a hash algorithm at the recipe level.  I could combine the ingredients, title, description, method, etc and calculate an MD5 hash of the data.  I would store this hash within the Recipes table, and check to see if the hash already existed before adding a new recipe to KitchenPC.  I spent quite a bit of time mucking with Sitemaps in a text editor (some of them millions of lines long!) attempting to remove duplicate recipes or irrelevant URLs, all because I was so nervous about having the same recipe twice in KitchenPC.  Adding this hashing check would definitely ease my fears.

Stay Tuned!

In the next post, I’ll be talking about the actual indexer.  This code is written in C#, and is able to actually parse the HTML obtained from the web crawler into valid KitchenPC formatted recipes.  Exciting stuff!

Advertisements

From → Technical

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: