Archive for October, 2003

Database versioning techniques

Friday, October 17th, 2003

Today I had an interesting discussion with one of my colleagues. The problem being discussed had already troubled my mind a few months back, but was driven out of my mind by other, more pressing, problems (deadlines, etc.). The problem, as the title of this article suggests, is database versioning techniques. Vie searched the internet for a solution to this problem and what I found was a lot of misconception. So to eliminate that, I start of describing the problem first. Be careful, I smell a long article coming up.

A good developer keeps all his source code in an SCM tool (like CVS, Perforce, Arch, Bitkeeper, Subversion, but not Source Safe). There are some very good reasons to do this, which Imp not going to discuss here. Most applications handle data. There are even those amongst mankind that go as far as saying software and data are the same, which is true. (If you like that kind of conceptual difference, read Gödel, Escher, Bach: An Eternal Golden Braid by Douglas Hofstadter. It seems to happen in human cell replication all the time.) A large portion of this data handling software manipulates data in a convenient way: enter the database! Your software is coupled to this database. And as your software evolves (new features!) your database does too (new and/or altered tables!). So both evolve (hold that idea please).

Now, to keep the software part under version control there are three different things at work:

a) the plain text source code
b) a build script (arguably a meta-program, see G.E.B.)
c) the output of a) and b) combined, the actual program

Now since both a) and b) are under source control, c) is always achievable. (Well, in practice you need to version you build tools, which are used by the build script, too. But if you go down this road you’d have to version your entire build machine.)

Now consider the case for that database. The following aspects are at work here:

I) a SQL scripts

II) a build script that loads a) into a particular database.

Uh… (thinking) Nope, won’t work. Because most systems have a live (or production) environment. Once data has been entered into the database, you can’t recreate it, because that information is not in either I nor II! And this is exactly the problem. Most developers don’t go this far, and only provide I and II with their installation. But once you have to keep your production environment data alive, you have to reconsider your solution. Here we go:

I) a SQL script

II) incremental SQL scripts that alter a live environment from version x to y

III) a build script that will run I and/or II as needed, depending on the already existing database and its version

Already better. Now this is what you see with most applications that have encountered this problem. And this will work, if your don’t have 100 production databases. Problem solved, but not for us. (By the way, above solution will introduce some nasty dependencies when you are using multiple branches to store your source code and database scripts in your SCM!). So we need a different solution.

Lets iterate some goals this solution must provide:

– it should be highly scriptable, because were using this with a lot of databases
– we should be able to replay the action, to be able to test the solutions’ successfulness
– it doesn’t have to be reversible (this is plain out impossible to achieve)

After some thinking I came up with the following solution. I was intrigued by some discussions on this topic and one of the problems seems to be to have the ability to diff SQL dump files. That people are looking at this for their solution to this problem is pretty easy to see: SQL dump files can be stored in an SCM tool like any other plain text file. But my quick searches on Google and Joel didn’t turn up such a tool. Apparently this is something pretty hard to develop. So then it hit me: there are tools to diff instances of SQL dumps. With instances I mean real databases. Such tools take database A and database B and output a script that contains SQL instructions to get from A to B. Some of these tools don’t only look at the database schema, but also at the data itself! That was halve the puzzle, I thought.

So, I think the following is good (final) solution for this particular problem:

1) With each software version make a database dump, and store it along with the source code in the SCM tool of your liking.

2) During a build, create a (temporary) database from this file, under a different name than your production database

3) Next step during the build is to fire up the diff tool and let it generate a script

4) Run the generated script

Whether this will work depends on the quality of the diff tool, obviously. But those tools are not unheard of. So there is a good chance that they exist, hopefully scriptable too. Then, there is still one last issue to resolve. Look at the goals, there it said that it is plain out impossible to reverse versions (this is like a one way street folks!). In order to ascertain that we don’t try to downgrade our database the SQL dump file (step 1) could include a timestamp. Then we can insert an extra step between steps 2 and 3, that checks to see if the production database already has a newer timestamp. If so, don’t run steps 3 and 4.

Well, maybe those timestamps don’t suffice alone. Maybe we need to stick in some branch or version information too, but you get the idea.

The good part of this solution is that you can automate every step of this solution. Even better we can reproduce it, which aids testing before putting your stuff up on a production environment.

Well, imp calling it a night. Tomorrow ill devote some time at tricking a good diff tool for MySQL databases and see if I can prototype this kind of thing. (If anybody has some good ideas about this problem, solution of diff tool, drop me a line!).

(BTW: I wrote this a quick rant. I still have to see if this will actually work! But be sure to follow this blog because I will report my findings.)

Edit: Hmm, almost within 10 minutes I detected a flaw in this setup. I’ll write about it at some later time.

Mutex, oh mutex, why didn’t I think about you?

Thursday, October 16th, 2003

Today something reminded me of something I picked up from Jeffrey Richter, who talked about Mutexes in his book (Programming Windows Applications). I just failed to apply the principle to something that lives in a different context. The mutexes which I learned about were Windows API programming. The problem I faced was in shell scripting under Linux.

One thing with updating a live environment (read: a production server) is that you need to be in control of that environment to install a new version. We actually have this part pretty well covered, so that is positive. But as always, there are things to improve upon.

For our application we use a series of cronjobs that do all the maintenance-like tasks belonging to our application. In order to update the live environment we disable all the cronjobs. However, there is still this chance that one of those cronjobs already started and is still busy doing its job, while we try to update the environment. Not good. So that is why I wrote some shell scripts (we are on Linux) that parses the crontab file and uses a listing of programs that are currently running in the system (‘ps’ anyone?) and compares the two. If one of the cronjobs is found running on the system, we wait for it to finish and repeat the procedure, until all cronjobs have finished. This worked pretty well, and since I don’t write Bash scripts and Awk scripts everyday I picked up some new skills.

But then a colleague of mine told about how he fixed the same problem in a different project. That was when it finally hit me: instead of going through the trouble of parsing some text file, parsing output of other programs, comparing and waiting; all pretty error prone I should have thought about mutexes!

A mutex gets its name from MUTually EXclusive (other one or the other-type of thing). In windows there are API’s to create these things and you can use them between multiple threads or processes. One of those threads gets the handle to a mutex and the others are able to wait upon that mutex to be returned to the operating system, who on it’s turn decides which one of the waiting threads get the handle to that mutex next. This is used go protect a resource (like accessing a shared variable or the like).

My situation with the cronjobs was pretty similar: either the cronjobs got control of the production system or our installation script. Seems pretty mutually exclusive to me, in after sight at least.

So what to do about this? Linux doesn’t support mutexes (at least I haven’t heard of these things) in a shell (sure they are implemented in C/Java/etc.). So we have to settle for something that simulates it’s behavior, something that was the defacto practice 10, maybe 15 years ago: usage of the file system to communicate between threads and/or processes (so called semaphores).

Every cronjob will create a file in a specific directory. This should be the very first this the cronjob does. The last thing it does is removing this particular file. Then, when our install script is going to run it just needs to keep monitoring this directory, until all files are gone! If you’d like true mutex behavior, you’d have to let the install script write a file of it’s own and prohibit all the cronjobs from running while this file is present. This, however, is not needed in our situation because we first turn off the cronjobs all together.

One last thing that this solution doesn’t provide is something the windows API variant did provide: atomicity! But, I leave that as an exercise to the reader (if they are not thoroughly bored by now)!