Notebook
October 17th, 2003 by Jilles

Today I had an interesting discussion with one of my colleagues. The problem being discussed had already troubled my mind a few months back, but was driven out of my mind by other, more pressing, problems (deadlines, etc.). The problem, as the title of this article suggests, is database versioning techniques. Vie searched the internet for a solution to this problem and what I found was a lot of misconception. So to eliminate that, I start of describing the problem first. Be careful, I smell a long article coming up.

A good developer keeps all his source code in an SCM tool (like CVS, Perforce, Arch, Bitkeeper, Subversion, but not Source Safe). There are some very good reasons to do this, which Imp not going to discuss here. Most applications handle data. There are even those amongst mankind that go as far as saying software and data are the same, which is true. (If you like that kind of conceptual difference, read Gödel, Escher, Bach: An Eternal Golden Braid by Douglas Hofstadter. It seems to happen in human cell replication all the time.) A large portion of this data handling software manipulates data in a convenient way: enter the database! Your software is coupled to this database. And as your software evolves (new features!) your database does too (new and/or altered tables!). So both evolve (hold that idea please).

Now, to keep the software part under version control there are three different things at work:

a) the plain text source code
b) a build script (arguably a meta-program, see G.E.B.)
c) the output of a) and b) combined, the actual program

Now since both a) and b) are under source control, c) is always achievable. (Well, in practice you need to version you build tools, which are used by the build script, too. But if you go down this road you’d have to version your entire build machine.)

Now consider the case for that database. The following aspects are at work here:

I) a SQL scripts

II) a build script that loads a) into a particular database.

Uh… (thinking) Nope, won’t work. Because most systems have a live (or production) environment. Once data has been entered into the database, you can’t recreate it, because that information is not in either I nor II! And this is exactly the problem. Most developers don’t go this far, and only provide I and II with their installation. But once you have to keep your production environment data alive, you have to reconsider your solution. Here we go:

I) a SQL script

II) incremental SQL scripts that alter a live environment from version x to y

III) a build script that will run I and/or II as needed, depending on the already existing database and its version

Already better. Now this is what you see with most applications that have encountered this problem. And this will work, if your don’t have 100 production databases. Problem solved, but not for us. (By the way, above solution will introduce some nasty dependencies when you are using multiple branches to store your source code and database scripts in your SCM!). So we need a different solution.

Lets iterate some goals this solution must provide:

– it should be highly scriptable, because were using this with a lot of databases
– we should be able to replay the action, to be able to test the solutions’ successfulness
– it doesn’t have to be reversible (this is plain out impossible to achieve)

After some thinking I came up with the following solution. I was intrigued by some discussions on this topic and one of the problems seems to be to have the ability to diff SQL dump files. That people are looking at this for their solution to this problem is pretty easy to see: SQL dump files can be stored in an SCM tool like any other plain text file. But my quick searches on Google and Joel didn’t turn up such a tool. Apparently this is something pretty hard to develop. So then it hit me: there are tools to diff instances of SQL dumps. With instances I mean real databases. Such tools take database A and database B and output a script that contains SQL instructions to get from A to B. Some of these tools don’t only look at the database schema, but also at the data itself! That was halve the puzzle, I thought.

So, I think the following is good (final) solution for this particular problem:

1) With each software version make a database dump, and store it along with the source code in the SCM tool of your liking.

2) During a build, create a (temporary) database from this file, under a different name than your production database

3) Next step during the build is to fire up the diff tool and let it generate a script

4) Run the generated script

Whether this will work depends on the quality of the diff tool, obviously. But those tools are not unheard of. So there is a good chance that they exist, hopefully scriptable too. Then, there is still one last issue to resolve. Look at the goals, there it said that it is plain out impossible to reverse versions (this is like a one way street folks!). In order to ascertain that we don’t try to downgrade our database the SQL dump file (step 1) could include a timestamp. Then we can insert an extra step between steps 2 and 3, that checks to see if the production database already has a newer timestamp. If so, don’t run steps 3 and 4.

Well, maybe those timestamps don’t suffice alone. Maybe we need to stick in some branch or version information too, but you get the idea.

The good part of this solution is that you can automate every step of this solution. Even better we can reproduce it, which aids testing before putting your stuff up on a production environment.

Well, imp calling it a night. Tomorrow ill devote some time at tricking a good diff tool for MySQL databases and see if I can prototype this kind of thing. (If anybody has some good ideas about this problem, solution of diff tool, drop me a line!).

(BTW: I wrote this a quick rant. I still have to see if this will actually work! But be sure to follow this blog because I will report my findings.)

Edit: Hmm, almost within 10 minutes I detected a flaw in this setup. I’ll write about it at some later time.

October 16th, 2003 by Jilles

Today something reminded me of something I picked up from Jeffrey Richter, who talked about Mutexes in his book (Programming Windows Applications). I just failed to apply the principle to something that lives in a different context. The mutexes which I learned about were Windows API programming. The problem I faced was in shell scripting under Linux.

One thing with updating a live environment (read: a production server) is that you need to be in control of that environment to install a new version. We actually have this part pretty well covered, so that is positive. But as always, there are things to improve upon.

For our application we use a series of cronjobs that do all the maintenance-like tasks belonging to our application. In order to update the live environment we disable all the cronjobs. However, there is still this chance that one of those cronjobs already started and is still busy doing its job, while we try to update the environment. Not good. So that is why I wrote some shell scripts (we are on Linux) that parses the crontab file and uses a listing of programs that are currently running in the system (‘ps’ anyone?) and compares the two. If one of the cronjobs is found running on the system, we wait for it to finish and repeat the procedure, until all cronjobs have finished. This worked pretty well, and since I don’t write Bash scripts and Awk scripts everyday I picked up some new skills.

But then a colleague of mine told about how he fixed the same problem in a different project. That was when it finally hit me: instead of going through the trouble of parsing some text file, parsing output of other programs, comparing and waiting; all pretty error prone I should have thought about mutexes!

A mutex gets its name from MUTually EXclusive (other one or the other-type of thing). In windows there are API’s to create these things and you can use them between multiple threads or processes. One of those threads gets the handle to a mutex and the others are able to wait upon that mutex to be returned to the operating system, who on it’s turn decides which one of the waiting threads get the handle to that mutex next. This is used go protect a resource (like accessing a shared variable or the like).

My situation with the cronjobs was pretty similar: either the cronjobs got control of the production system or our installation script. Seems pretty mutually exclusive to me, in after sight at least.

So what to do about this? Linux doesn’t support mutexes (at least I haven’t heard of these things) in a shell (sure they are implemented in C/Java/etc.). So we have to settle for something that simulates it’s behavior, something that was the defacto practice 10, maybe 15 years ago: usage of the file system to communicate between threads and/or processes (so called semaphores).

Every cronjob will create a file in a specific directory. This should be the very first this the cronjob does. The last thing it does is removing this particular file. Then, when our install script is going to run it just needs to keep monitoring this directory, until all files are gone! If you’d like true mutex behavior, you’d have to let the install script write a file of it’s own and prohibit all the cronjobs from running while this file is present. This, however, is not needed in our situation because we first turn off the cronjobs all together.

One last thing that this solution doesn’t provide is something the windows API variant did provide: atomicity! But, I leave that as an exercise to the reader (if they are not thoroughly bored by now)!

July 13th, 2003 by Jilles

Developing… I’d like to contribute!

I’ve made small contributions to open source projects. Someone on the translation front, others by contributing idea’s and/or actual code. But I’d like to do more (after all I’ve been using open source software so much!). One of the things I always dread is to ‘get into the project’. You have to read so much: how is the architecture, get all the dependencies, read the policies regarding commit access, testing, etc. So I’m trying to select a project on which I want to focus for the next few months, trying to come up with some useful patches in my free time. A couple of candidates: Apache, Subversion, Anthill, Phing/Binary Cloud, PHP5 and Mozilla. All these projects touch my professional work as well, and that is A Good Thing(tm). I’d like to walk through these projects one by one. (This it actually not to inform you of something I’m going to do, just my ‘internal’ decision making process, exposed to the rest of the world. Read it or disregard it as you see fit :-o).

What would I like to find in a project on which I’m going to work in my (so scarce) free time?

Multiplatform: I’d like to do something that I could use/develop on Linux as well as on windows (the latter having less priority of course).

Perspective: It should be a project that has a lifetime in front of it (not something that is going to be replaced in a few years, e.g. CVS).

Language: Preferably a compiled language (C/C++/Java/etc.) instead of an interpreted language (PHP/Perl/etc) because that’s what I work with professionally. I’d like to keep my skills sharp over the entire line.

Documentation: A project with a good organization has the advantage (e.g. RFC’s/bug reports etc. are input to a starting contributor). A HACKING file or some other documentation to orientate starting developers is an advantage!

Easy Compile-Run-Debug cycle: I’d like to contribute to the Linux kernel but compiling the kernel, installing it, rebooting it… that’s too long a cycle for yours truly.

Coolness factor: this property of a project is undefined but working on some projects is just cooler than working on other projects (e.g. Linux kernel is way cool, writing man pages is definitely not :-?).

Need for developers: some projects are developed into their 2.x versions. Clearly they don’t need the developers as much as a 0.1beta project. If you want to contribute something the latter has more need of you.

Albright, here is my decision matrix(tm).

Apache

Subversion

Anthill

Phing/BC

PHP5

Mozilla

Multiplatform

7

7

7

7

6

7

Perspective

7

8

6

6

8

7

Language

7

7,5

6

6,5

7

7,5

Documentation

8

7

6

6,5

6,5

7

Devel-cycle

7

8

8

8

6,5

7

Coolness

7,5

9

7

7

8

8

Developers needed

6

8

7

8

8

7

Totals

49,5

54,5

47

49

50

50,5

(Blab bla: my opinion, would like comments, this is my opinion, these values may or may not be based on the truth)

Well if you’d take this table, I’d go working on Subversion… I’d like that very much. But, this table itself is not definitive. So if you, the reader, know any topics on which the above projects should be judged, before choosing a project, please tell me and I’ll include them in the above table. The same goes if you know of any projects that might be cool for me to join. Mail me! or place a comment here, or backtrack, whatever you like.

First I’ve got some other stuff to finish, and in the meantime I can read documentation and stuff, to get into this new project…

July 12th, 2003 by Jilles

I’ve been playing around with SCM systems (Software Configuration Management) earlier (played around with CVS, VSS and BitKeeper). Of these three CVS has in my opinion the best thing going for it (large, very large, adoption easily extendable, etc.). At that time (a few years back now) I already noticed Subversion, abbreviated to SVN. At that moment they had all the plans ready (architecture, etc.) and I think they were in their 0.1x’s (version). Now I’m running a decently sized project on CVS I’m keeping my eyes open for alternatives, and SVN is one of the best candidates (BitKeeper looses it because of the licenses).

One of the best features of SVN (in my opinion) is the branching strategy they’ve taken. Although I already had experience with CVS, other team-members didn’t have that when we switched to CVS. Explaining all the stuff and the command line options is simple enough. But the one thing I’ve noticed that is the hardest to pick up for new users is branching. Of course, you’ll draw some pretty tree-like pictures and start explaining the stuff; everyone says he/she gets it. But when it is put to practice I see enough things going wrong regarding branches. One of the biggest obstacles, imho, is the fact that branches are on another dimension in CVS. People forget to switch between branches and trunk because it’s not obvious on which one they are working. This is not the case in SVN, here a branch is just a full copy of the trunk, originating from a certain revision. You can do stuff like this with SVN:

\Project1
\trunk
\src
\doc
\branches
\feature1
\src
\doc
\jilles-playing-ground
\src
\doc

This is a huge improvement over CVS and it will be so much easier to explain this to people new to SCM/CVS/SVN. Perfect. Another improvement is very obvious: every change to the repository results in a new revision of the entire repository. This way directories and such can also be versioned.

Why am I still using CVS? Well, I’m still missing some features from SVN, but those will come in time. There is a nice cvs2svn script that I will check out sometime. But paraphrasing Joel Spolsky: a good tactic to convert people to your product is to make it easy to switch back. That’s why I’d like to see a svn2cvs script. Having such a script enables people to easy switch back to CVS once they decide that they like that better (unlikely) or they decide that SVN doesn’t have the features they want yet (more likely that the previous reason).

One other disadvantage of SVN: the tags. Under CVS a branch and a tag are two different beasts. Under SVN (as far as I see it/read it in the docs) they are both the same. Besides the directories ‘trunk’ and ‘braches’ you can create a directory ‘tags’ and essentially create branches. So far so good, thanks to SVN’s shallow copying technique this is fast and doesn’t consume much disk space. My take on a tag is “a constant snapshot of the tree at a certain moment”. Note the word constant. Once I declare version 1.4 of my product and build, package and ship it I do not want the option/feature of changing the 1.4 tag in the repository and thereby creating a difference between the 1.4 version in the tree and the 1.4 version that is installed at the customer. Of course, an administrator should be able to move that tag around in case of an erroneous tag command but this should not be made easy. But in subversion a tag is tag only because of the way the developer looks at it. I’d like a command in SVN that would enable me to “freeze” a branch: such that once I create a tag (read: branch) I can freeze it and thereby disallowing all developers from committing to that branch, ensuring that version X in the tree is the same version X that I’ll ship.

(Of course, I know that if you create a ‘tag’ and some developer commits changes to it, you can easily back out of those changes, but that would just be mending your wounds instead of preventing the wound ever from taking place.)

All in all though, I like SVN (much better than CVS). I’m just biding my time till they reach a version more close to their 1.0 release