Archive for the ‘Website architecture’ Category

A Sneak Peak at Marktplaats

Monday, April 28th, 2008

On April 12th I presented at PFCongrez, a yearly gathering of PHPFreakz. During the day three other presentations were given. The first one was by Peter-Paul Koch (ppk for short) who presented about unobtrusive javascript. The presentation was given with a lot of energy, enjoyable!

After that, I presented a sneak peak at Marktplaats during which I gave some insight into what it takes to run one of the biggest sites of the Netherlands. It goes into some of the high level production setup, highlights some of the challenges of operating hundreds of database deployments and goes into some of the aspects Marktplaats runs into while using PHP.

The slides are embedded below, or up for download here. Or join the after party discussion.

The other presentations during the day that can be found online:

Next to that, I am trying to gauge any interest in our tool to manage database schema’s: DBC. It keeps database schema’s in synch with your application, and allows developers to branch off the database schema much as a version control system allows you to do with your code. If you are interested in this tool, please contact me at “jilles &at& marktplaats . nl”. We are looking to see if there is enough interest to open source the tool.

Software development is hard

Saturday, October 6th, 2007

Kyle Wilson wrote recently a really nice piece on why software development is so hard, which for me didn’t include new insights (I’m already convinced) but did a really nice job on quantifying the problem space. Something which I had not seen before so clearly articulated. If you’re in this line of business, it’s a must read.

The thing that makes this article so interesting is that for some reason Kyle has access to information about five large software development projects: Chandler (the OSS Exchange replacement), Myst Online, Fracture (a new game), the software that controls a F-22 fighter jet and the FBI’s Virtual Case File.

After describing some of the pitfalls the Chandler team fell in, he goes on trying to outline why Lines of Code (LOC) is a useless metric for determining the complexity of a software program. More importantly, he throws in some statistics of the aforementioned projects that really hits this home.

Short list of conclusions:

• LOC is useless as a means to describe either the complexity of the program or the amount of effort that went into producing it
• Project teams need an economic framework (in the broadest sense of the word) in order to be successful. Otherwise there is no forcing function for decisions (like design choices, feature sets and release dates).
• In theory the complexity of a well-structured program should be O(n), where n is the number of lines of code (each line only tightly coupled with the line preceding and after it). A poorly structured program would be O(n2), with dependencies on one particular line throughout the codebase.

Favorite quote, from the 1968 NATO Software Engingeering Conference: “We undoubtedly produce software by backward techniques. […] We build systems like the Wright brothers build airplanes — build the whole thing, push it off the cliff, let it crash, and start over again”.

And this one: “Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves” — Alan Kay (the father of Smalltalk).

Vendors vs Application providers

Saturday, October 6th, 2007

Vogels (the CTO of Amazon) published a paper in which they describe their high available, eventually consistent data storage called Dynamo that will scale incrementally. It was an excellent read, and if you’re in the business of providing a high traffic, high available application (web or otherwise) I suggest you take a look!

That post did re-iterate with me a point I came across before: why is it that a company like Amazon is building these types of infrastructure components? There are other examples like it, providing excellent world class technology within eBay. Or more publicly why did LiveJournal.com develop memcached or Mogile? Why did Google write GFS? And the list goes on. This, by the way, is not just pertained to storage solutions. Within eBay I see some really cool technology that could be spinned off into separate products in different area’s, but I am not in a position to disclose those.

I do see that having such a technology could be a competitive advantage (for a while) but at this scale I’m not sure that that really holds. For example both Amazon and Google currently have a highly scalable data store (Dynamo vs GFS). (They are a bit different with Dynamo storing data smaller than 1MB)

Those technologies are really cool, and scratch an itch that is absolutely there for these companies but bottom line eBay, Google, LiveJournal should be adding features and improving the user experience above writing infrastructure components. Now, in order to either a) write those features or b) bring down operational cost (or availability up) you might need these technologies but that does not translate 1:1 into actually writing them. Ideally, an Application provider such as Amazon should be able to come up with a cool feature, buy the technology needed to back that feature up and develop the feature using the technology bought.

Now, why is it then that no 3rd party vendor stepped into this space and provided similar technology? Why is it that noone from these companies started off on their own and started a company providing a technology like Dynamo? Why doesn’t a big database or storage vendor step into this space? Clearly there are some big companies out there that need this technology (Amazon, eBay, Google, Yahoo, and there are certainly more). So, really, why has nobody stepped into this space? Or, in reverse, which companies provide these types of products?

Memcached usage across large web properties

Tuesday, May 29th, 2007

Lately a discussion on the memcached-mailing list has started where for example the guys behind facebook.com and bloglines.com are participating and sharing some of their experiences. I’m don’t think this is rocket science, but I’d like to quote some of the things that are being said and provide some links to the relevant discussions.

About the general “would you want to bet your uptime on memcached as an infrastructure component?”-question:

We consider memcached a critical part of our infrastructure. The benefit of memcached in a typical setup is to reduce the amount of database hardware you need to support an application; if you have enough database horsepower to run unimpaired with most of your memcached servers out of service, then there¹s probably no point using memcached at all, since it without a doubt adds extra complexity to your application code. [link]

If you shard all you data, etc. etc., is memcached still worth it?

Question:
And you would split (federate) your database into 100 chunks (the remaining 100 would be hot spares of the first 100 and could even be used to serve reads), wouldn’t that take care of all your database load needs and pretty much eliminate the need for memcache? Wouldn’t 50 such boxes be enough in reality?
Answer:
Don’t forget about latency. At Hi5 we cache entire user profiles that are composed of data from up to a dozen databases. Each page might need access to many profiles. Getting these from cache is about the only way you can achieve sub 500ms response times, even with the best DBs. [link]

Also, there is a lot of talk about a FUSE (File system in user space) filesystem based on top of memcached. Not only would that make caching available for those applications you do not control (blackbox) but it would have some really great advantages for your generic PHP app:

Over the last two weeks i spent a lot of time discussing a memcachefs (fuse-based) with two fellow geeks - applications that came to mind were (a) the smarty cache (b) php sessions; for both cases, losing files (as a whole, not random parts inside) is ok and readdir is irrelevant, which allows cutting a lot of corners. [link]

PHP vs Ruby on Rails

Tuesday, May 29th, 2007

Terry Chay over at “The Woodwork” has a length but nicely written blog post about a PHP vs Ruby on Rails discussion. If you’re interested in that kind of stuff, read the article: it has some juicy humor sprinkled into it as well; it’s a bit flame bait too…

Favourite quote (quoting another quote):

“First they ignore you, then they laugh at you, then they fight you, then you win.”
—Mahatma Ghandi

OSCON 2005:

“Unless you’re Ruby.”
—Danny O’Brien, “On Evil”

And:

I can’t speak for Alex, but what I’m saying is look at the top 100 websites on the internet: about 40% of them are written in PHP and 0% of them are written in Rails. (Yes, I can (and am) using this statistic to grind you Ruby fuckers into the dust.)

Database? No database?

Friday, April 28th, 2006

There are a lot of interesting discussions and posts going on lately on how certain high profile websites. Particulary, a series of posts on O’Reilly Radar. The first post is about Second Life, where they talk about MySQL (seperate master-slave pairs handling the data partitions with one master-slave pair indicating where what data lives).

The second installment features Bloglines which uses MySQL (Users and passwords in one master-slave, feed information in another) but also large parts in some file storage.

A third posting talks about Flickr, who basically started out with the “one database fits all”-methodology on MySQL. And this is where I think experience really comes into play. If you were designing the Flickr database with, for example, Bloglines’ experience under your belt you would not have started out that way. But Flicker too, couldn’t escape the segmentation and divided their data up in what they call “shards” (seperate master-slave pairs as I read it).

So, it is pretty apperent that MySQL is being used for some fairly large sites while most of these employ the strategy of segmenting the data across several master-slave combinations. Some are actually useing LiveJournals’ memcached too!