Data, data everywhere: Using DBLink for PostgreSQL
Those who know me know I’m a huge fan of PostgreSQL. In fact, KitchenPC uses it exclusively for all its database needs. Recently, I developed a solution to easily facilitate the moving of data between my production database and staging database, as the latter is currently being used to test the re-invention of the site.
Since the two database schemas are often slightly different, traditional approaches of exporting and importing database dumps simply don’t work very well, and would be cumbersome if I simply wanted to pull in a few rows of data. In fact, one scenario in particular where this ability comes in handy is creating meal plans. Those who have recently volunteered to help me build these plans are able to enter in recipes using the current implementation of the site, and I can then quickly run a single SQL query to pull in all new recipes from production into my staging database. To do this, I’m using a PostgreSQL package called DBLink. I figured I’d write a quick tutorial on how to set this up, in case anyone runs into a similar need.
Step 1: Setup DBLink
With Postgres 9.0 and above, the binary modules required for linking across databases are installed by default. However, you’ll need to run a SQL script to create the necessary functions to access them from your database. This SQL script is called dblink.sql and should exist on both Windows and Unix installations of Postgres in the contrib directory. Depending on your OS and installation, you might have to install a separate contrib package to get this file. For example, on Ubuntu you can install the necessary package using:
sudo apt-get install postgresql-contrib
Once you find dblink.sql, you can simply run the file using psql, or your query tool of choice. This command will look something like:
psql -h localhost -p 5432 -f dblink.sql -d dbname -U root -W
The -h option specifies the hostname and should use the local instance by default. The -p option specifies the port, which will usually be 5432. The -f option tells psql to run the SQL commands in a file. It is assumed you’ll be running this command from the contrib directory, so I didn’t fully qualify the path. Replace dbname with the database you’d like to create the dblink functions in, and you can replace root with any user with sufficient privileges in this database.
When you run this command, you will hopefully see something like the picture above.
Step 2: Making Sure It Works
Next, open up your favorite database program, such as pgAdmin, and connect to your database. You should now be able to run a command such as:
select * from dblink( 'hostaddr=10.0.0.1 dbname=ProductionDB user=Website password=secret', 'select * from users') as t1( userid uuid, email varchar(50), alias varchar(50), fullname varchar(50), password varchar(100) );
The dblink function above takes two parameters. The first is a connection string and the second is a SQL command to run on the remote server. You’ll also need to specify the columns and data types being returned by your query.
Once you get to this point, the rest is easy!
The KitchenPC Implementation of DBLink
What I chose to do was create a new DB schema called ProdLink that mirrors the table schema of production. Within this schema, I create a series of views that return the matching data from the remote database. First, I create the schema:
CREATE SCHEMA ProdLink;
Next, I create a VIEW within this schema for each table. For example:
CREATE OR REPLACE VIEW ProdLink.Recipes AS select * from dblink( 'hostaddr=220.127.116.11 dbname=KitchenPC user=mike password=pwd', 'select * from Recipes') as t1( recipeid uuid, title varchar(100), description varchar(255), dateentered timestamptz, imageurl varchar(100), credit varchar(100), crediturl varchar(255), servingsize smallint, preptime smallint, cooktime smallint, rating smallint, steps text, ownerid uuid, publicedit boolean, textsearch tsvector);
This view simply runs a query through DBLink to pull in every recipe from the production database. Once that view is created, I can now just run:
SELECT * From ProdLink.Recipes;
and instantly see every recipe in production. Since this is simply a database view, I can now do anything I could with a normal view. For example, if I want to pull in any recipe from production that doesn’t already exist in my staging database, I can run:
INSERT INTO Recipes SELECT * FROM ProdLink.Recipes L WHERE NOT EXISTS (select 1 from Recipes where Recipes.RecipeId = L.RecipeId);
One nice thing about this method is the schemas between databases can be out of sync, and I can handle this through the INSERT statement itself. I can specify default values for newly created columns, use conditional logic, etc.
After creating a VIEW within the ProdLink schema for each production table, I then created a series of commands to freshen the data in my staging database with production data, which I can run on a daily basis or as I see fit.
My goal is to keep my staging database up to date with the latest data from production so that when the new site launches, I can minimize downtime by simply migrating the staging data itself into the new production database, rather than having to worry about upgrading the old production database to the new schema.
This may or may not be the best way to handle data migration between production and an evolving schema under development, but so far it has worked pretty well for me. It’s quite possible that this may be an awful solution for massive databases, as each of these views is going to “download” all the data every time, however that could probably be improved with some sort of time stamp or watermark. Luckily, I’m not yet to the point where I need to worry about huge amounts of data.
Not a Postgres fan? Other major database systems have similar database linking features, so it’s quite possible to adopt these techniques using your database of choice. However, I’ll leave that up to you!