Lately, I’ve been trying to bring order to chaos. It turns out it’s not very easy to prioritize 2.3 million parse errors across half a million recipes. I know that a 3% success rate is nowhere near where I want to be, but I also know that trying to manually address every issue would be an exercise in futility.
That begs the question: Where can I spend my time to get the most return?
In my last post, I went over some of the errors due to missing NLP data as well as the top ten parse errors. Luckily, out of the top 1,000 parse errors, over 60% could be fixed with zero changes to the engine.
Unfortunately, there are still a lot of errors that cannot be fixed simply by adding new ingredients, synonyms or ingredient form mappings. These are things that will require actual changes to the NLP engine and perhaps even the underlying architecture of KitchenPC itself.
First, I noticed a huge number of parse errors where the ingredients simply had no amounts listed. For example, pepper to taste came up quite a bit. I ran the following SQL query to attempt to estimate how popular these ingredients are.
select count(distinct PageId) from Indexer.ParseErrors where usage !~ '\\d+';
For those unfamiliar with Postgres, this provides a count of the number of crawled pages where an ingredient usage does not match the regular expression \d+ (meaning anything that does not contain any numbers.)
This yielded 167,890 recipes that were not being processed due to missing amounts. Ouch! This is well over 30% of the crawled data made useless since KitchenPC requires amounts. Can I really live with this?
I’ve decided the answer is no. Thus, KitchenPC needs to be designed to handle these recipes in some way that still makes sense. I found two options for this.
Option 1 was to assume a default amount on a per ingredient basis. For example, every time the parser ran into “pepper to taste”, I could assume an amount of one tablespoon. This amount would be added to the shopping list, however the recipes would simply say “pepper to taste” and indicate no amount. This would avoid massive architectural changes to both the code base and the database since there would still be real numbers under the covers. However, it would also require me to provide default amounts for potentially hundreds or even thousands of ingredients, some of which are almost impossible to estimate.
Option 2 was even scarier. This was to make KitchenPC work without ingredient amounts. However, what would this mean for ingredient aggregation? How would the “What Can I Make?” feature work with these recipes?
In the end, I decided to go with the second option. I like the idea of KitchenPC being a bit less strict. Recipes need to be natural, and not restrictive. The technology needs to adapt to how humans express themselves, not the other way around.
It took an entire evening, but I was able to successfully redesign the KitchenPC database and business objects to allow the concept of a null amount. But how about the shopping list and modeler?
The modeler was actually fairly easy. It will consider a recipe that has an unspecified amount of a listed ingredient, and score it at 25% efficiency. For example, if you have 12 eggs, it will treat a recipe that has “eggs” as a recipe that requires “3 eggs”. Why special case the scoring for this? Well, if I treated “eggs” recipes as a perfect score, it would force these recipes to bubble to the top of the results and overpower recipes that specify an exact number of eggs. Plus, if 25% doesn’t work too well, I can fine tune this constant until I’m happy with the results.
The shopping list was a bit more difficult. What happens if you combine one recipe that calls for “3 eggs” and one that just calls for “eggs”? Obviously, I can’t just show “3 eggs” as this would not be enough. I didn’t want to show “3 eggs” and then a note telling the user they might need more as well. In the end, I decided the only approach that makes any sense is to simply lose precision. In other words, KitchenPC does not have the information it requires to aggregate an amount, thus it cannot. Combining “3 eggs” and “eggs” will yield the shopping list item “eggs” and that’s that.
I think these changes will make KitchenPC a more flexible site, and be able to handle more types of recipes. In the end, most users won’t care a whole lot about amounts on every ingredient in their shopping list, and the pre-built meal plans will of course only contain perfect recipes.
Another very typical parse error was range amounts. An example of this would be something like “2-3 cups of flour.” I was able to get a rough estimate of these using:
select count(distinct PageId) from Indexer.ParseErrors where usage ~ $$^\s*\d+\s*\-\s*\d+$$;
For those unfamiliar with the $$Foo$$ syntax in Postgres, this provides a way to represent a string constant without having to escape anything. Perfect for regular expressions!
This yielded 50,407 recipes that include at least one ingredient with a range amount. Not quite as prevalent as null amounts, but still worth fixing.
For this one, I also considered a few different options. First, I could average the amount. This would be a change to the parser itself, and only affect the NLP engine. The average amount would be stored in the database, and nothing else would change. Similarly, I could also just take the high amount as well.
For the same reasons as the null ingredient amounts, I decided against this. Making KitchenPC flexible and able to display recipes in the way they were written is important. So once again, the database and backend need to be redesigned to handle this.
This work item has not yet been completed, but will involve storing a high and low value in the database and adding code at the NLP engine level to parse and extract these amounts.
In this case, shopping list aggregation will be easy. Simply aggregate the high amount, as that’s the worst case – better to buy too much than too little. The modeler is also pretty easy; just assume the average or the high amount and call it good.
Right now, the top parse error is salt and pepper. There’s also a good amount of other ingredients that can be parsed as multiple ingredients, however I’m not quite sure on the best approach to fix this.
It could possibly be done at the parser level. If an ingredient could not be parsed, but it contained the word or, the parser could split on that word and try to parse each side separately. Though I know for a fact I need to handlesalt and pepper since it occurs in over 20,000 recipes, I still haven’t decided if I want a general solution for these types of ingredient usages. This one seems to be a one-off case that happens to be very popular.
Hopefully, fixing these design limitations as well as tackling missing NLP vocabulary will eventually start increasing the success rate of the indexer. If not, well I might have to re-assess the overall strategy.
I’m a bit depressed that this particular task is slowing down the launch of the new website. Everything else is ready to go, I just need the content and the categorization! Sigh. Hopefully stuff will start to turn around soon and work in my favor.