The first few days…
Well, I thought I’d write a quick post about the first few days of KitchenPC Beta. I’m exhausted, cranky, stressed out, on edge, sleep deprived, craving real food, and “paranoid” (I’ll explain in a bit) all at the same time.
First off, the good news. Everyone who’s used KitchenPC loves it and has had tons of good things to say. Albeit they are mostly my friends and prying “honest” feedback out of a good friend is like getting a straight answer from a politician. With that said, I think the initial impression people have of the site is great. Whether it will start to show signs of any exponential growth has still yet to be seen.
Now a list of all the @!#$ that has gone wrong so far.
Bugs Preventing User Signups
Within a few hours of launching the site, I began to see some errors in the log about crashes in the CreateUser method, which was used to create a new user account based on a Facebook logon. The exception was a simple “Null reference exception”, and of course the .NET stack trace won’t give you anything useful like what object was null.
I started adding more and more logging to track down what the issue was, and making sure I was checking every possible variable for a null value. The most annoying thing was this was working for some people (like me!) and not others. Thus, something about certain Facebook accounts caused the crash. After adding the Facebook email to the trace logs, I noticed a friend of mine was attempting to create an account. Luckily, he was on Facebook at the time so I sent him an instant message and asked if he could be my guinea pig as I tracked down the issue. After a bit more debugging and logging, I noticed the crash was right after it tried to import the account’s “current location” data, which I was using as the default setting for the “Location” profile data on KitchenPC. Turns out, I had missed a single period in code which said “if(user.location.current_city…”, and user.location was of course null. My brain kept seeing “location” and “current_city” as all one property name. Ugh. So checking for that fixed the problem, and I was ready to go. I’m not sure what actually causes this issue, but needless to say on certain Facebook accounts, applications can read this information, and on others it’s blocked. Yay for Facebook’s convoluted security model.
Meal Planner Engine Bugs
This doozy of a bug totally sucked all the life out of my Saturday night, however it was somewhat of a “fun” bug from a hacker perspective. I had a few people complain about the meal planner not working for them. It would either take too long and they’d give up, or the page load would just time out. Either way, it was working fine for me when I tried. I finally found a repro case that would cause the meal planner to lock up every time. All I had to do was demand 7 recipes that will make use of 3 beets.
Now, I use bitmasks to store the allowed tags and the tags a particular ingredient can link to. That way, I can just say (AllowedTags & IngredientYouHave.LinkedTags) and if that’s over zero, at least one recipe has that ingredient in that tag set. This is an incredibly fast way to start narrowing down recipes we can consider, and also detecting if a query is impossible.
At first, I thought my bitmasks were off by one. When the planner begins on a query, it checks these bitmasks and sees if there’s indeed any recipes in the database that meet your criteria, and if not, it throws an ImpossibleQueryException, which translates into a polite “No recipes found” error for the user. I watched under the debugger as it was trying to find a matching recipe but kept failing, and figured the query was impossible and my code wasn’t detecting that up front. Soon, I noticed the query was indeed possible, however there was only 1 matching “beet” recipe in the entire database and it kept on picking that. When trying to find a second beet recipe, it would only come back to the same one again and loop around, since the algorithm will continue and find a new one if the recipe already exists in the set. This caused an infinite loop.
So basically, if there’s zero recipes that match your criteria, I handle this case up front. If there’s 3 recipes and you need 7, crash boom. I was able to come up with a decent mechanism to catch this behavior for now, but I think the ideal solution will be to return the “partial” set to the user and say “Sorry, this was the best I could do.”
Server Problems
The site was down for a couple hours this afternoon due to an Apache issue. I use Ubuntu Server as my front end load balancer using Apache and mod_proxy. I have exceptions for the /scripts and /images directory, so the Apache server hosts all the static files and any dynamic page requests are routed to one of two Windows servers. This is cool because I’m prepared for a big traffic spike, and also I can take down one server by commenting it out of a conf file, and mess with stuff or upgrade it.
For now, I also run PostgreSQL on this Unix server. That won’t work out for the long term, but right now my database is only a couple megs and Postgres requires around 10 megs of RAM to run, it just wasn’t worth paying for another server instance for a tiny little database. When the time comes, provisioning a new database server will be easy; I just setup Postgres, migrate the DB contents over (or use streaming replication, setting up a warm stand-by) and then point the web servers over to the new database box. No DNS changes or anything.
However, due to a bug in Postgres 9.0’s installer (fixed in 9.0.2, thanks guys) there was a dependency on libuuid 1.6 which the installer didn’t install and of course doesn’t ship with Ubuntu 10.04 so I had to built it myself. Somehow during this process, libuuid.so.1 got linked to 1.6, which Postgres was quite happy about. Apache, not so much. However, Apache didn’t show any signs of discontent for two days. At which point it decided to crash and not restart. Even more fun, it would attempt to restart, crash, then leave the process hung in memory. I had like 20 Apache processes going on.
For this issue, I had to go bug my friend Brian (a Unix guru) to ssh in to the box and help figure that one out. I was not in a good mood, as the production server was totally offline the entire time. Ugh! Brian finally worked it out and I totally owe him a beer next time I’m in Vegas.
IIS Being.. well.. IIS.
I’ve also been running into some quite annoying problems with IIS deciding it was bored and no longer answering requests. It’s as if IIS goes “Ooo look at the squirrel” and ignores the socket. When this happens, the site just doesn’t pick up and eventually the browser times out. I was tail’ing the Apache logs to make sure mod_proxy was indeed forwarding the requests to IIS, which it was. However, IIS’s logs said nothing at all about the request; as if they never happened. Apache said the socket was closed unexpectedly or some such thing. This seems to happen after a few hours or maybe a day of uptime, and then I have to reset IIS to get it working again.
The workaround I’ve found for now is to write a jMeter (I love this program) script that loads the Login page every 60 seconds, which seems to keep IIS nice and busy and responding fast. This thing’s been running for days now, and the problem doesn’t occur as long as it’s running. However, I still need to get to the bottom of this nonsense. I do sometimes miss Microsoft, where I could just track down the guy who wrote the thing that doesn’t work and get him to help out.
Advertising
I don’t really know what defines “success” as far as a product launch goes, but I also seem to be off to a bit of a slow start as far as user signups. I’m just now seeing a few signups by people I don’t know, which is great because it means people are telling their friends, or word is getting around on Facebook through “So and so likes KitchenPC (Website)” posts. However, I’m still only around 70 user accounts total, and the number doesn’t really seem to be picking up very quickly.
This morning I emailed around 200 people from the survey, however these efforts might have been thwarted due to both the weekend and the hours of downtime due to libuuid problems.
I plan on putting a lot of focus into getting the word out during this week, hopefully I won’t be slowed down by any more annoying server problems.
There’s this whole other level of stress that accompanies a launch that I never saw coming. I thought that launching was a milestone, a point which represents a finality; where stress was relieved, not accrued. It just seems that everything matters more, there’s this frantic rush to fix serious bugs or get servers back online, since people are now counting on the site being up and people who I don’t know are using it. Every time I click on the link, I get paranoid that I’ll see a server error, or the recipe modeler will get stuck in another infinite loop, or Apache will be dead again or IIS will be looking at more woodland creatures. I guess this is just life, hopefully I can start getting some traction going and start focusing on how to actually improve the product, rather than just treading water to keep the thing online and working.
Next, I’ll be blogging about my next priorities to move the business forward. Stay tuned!