Recent reading

Cool articles ! You must read them.

Friday, May 19, 2006

Continuous Integration

Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly. This article is a quick overview of Continuous Integration summarizing the technique and its current usage.


I vividly remember one of my first sightings of a large software project. I was taking a summer internship at a large English electronics company. My manager, part of the QA group, gave me a tour of a site and we entered a huge depressing warehouse stacked full with cubes. I was told that this project had been in development for a couple of years and was currently integrating, and had been integrating for several months. My guide told me that nobody really knew how long it would take to finish integrating. From this I learned a common story of software projects: integration is a long and unpredictable process.

But this needn't be the way. Most projects done by my colleagues at ThoughtWorks, and by many others around the world, treat integration as a non-event. Any individual developer's work is only a few hours away from a shared project state and can integrated back into that state in minutes. Any integration errors are found rapidly and can be fixed rapidly.

This contrast isn't the result of an expensive and complex tool. The essence of it lies in the simple practice of everyone on the team integrating frequently, usually daily, against a controlled source code repository.

When I've described this practice to people, I commonly find two reactions: "it can't work (here)" and "doing it won't make much difference". When people find that it's much easier than it sounds, and see that it makes a huge difference to development. Thus the third common reaction is "yes we do that - how could you live without it?"

The term 'Continuous Integration' originated with the Extreme Programming development process, as one of its original twelve practices. When I started at ThoughtWorks, as a consultant, I encouraged the project I was working with to use the technique. Matthew Foemmel turned my vague exhortations into solid action and we saw the project go from rare and complex integrations to the non-event I described. Matthew and I wrote up our experience in the original version of this paper, which has been one of the most popular papers on my site.

See Related Article: Continuous Integration (original version)

The original article on Continuous Integration from this website. If you are following most links, this is the article they intended you to read (although I think the new one is better). It describes the experiences that we went through as Matt helped put together continuous integration through a project at ThoughtWorks in the early 00's.

Although Continuous Integration is a practice that requires no particular tooling to deploy, we've found that it is useful to use a Continuous Integration server. The best known such server is CruiseControl, an open source tool originally built by several people at ThoughtWorks and now maintained by a wide community. The original CruiseControl is written in Java but is also available for the Microsoft platform as CruiseControl.net.


Building a Feature with Continuous Integration

The easiest way for me to explain what CI is and how it works is to show a quick example of how it works with the development of a small feature. Let's assume I have to do something to a piece of software, it doesn't really matter what the task is, for the moment I'll assume it's small and can be done in a few hours. (We'll explore longer tasks, and other issues later on.)

I begin by taking a copy of the current integrated source onto my local development machine. I do this by using a source code management system by checking out a working copy from the mainline.

The above paragraph will make sense to people who use source code control systems, but be gibberish to those who don't. So let me quickly explain that for the latter. A source code control system keeps all of a project's source coded in a repository. The current state of the system is usually referred to as the mainline. At any time a developer can do a controlled copy of the mainline onto their own machine, this is called checking out. The copy on the developer's machine is called a working copy. (Most of the time you actually update your working copy to the mainline - in practice it's the same thing.)

Now I take my working copy and do whatever I need to do to complete my task. This will consist of both altering the production code, and also adding or changing automated tests. Continuous Integration assumes a high degree of tests which are automated into the software: a facility I call self-testing code. Often these use a version of the popular XUnit testing frameworks.

Once I'm done (and usually at various points when I'm working) I carry out an automated build on my development machine. This takes the source code in my working copy, compiles and links it into an executable, and runs the automated tests. Only if it all builds and tests without errors is the overall build considered to be good.

With a good build, I can then think about committing my changes into the repository. The twist, of course, is that other people may, and usually have, made changes to the mainline before I get chance to commit. So first I update my working copy with their changes and rebuild. If their changes clash with my changes, it will manifest as a failure either in the compilation or in the tests. In this case it's my responsibility to fix this and repeat until I can build a working copy that is properly synchronized with the mainline.

Once I have made my own build of a properly synchronized working copy I can then finally commit my changes into the mainline, which then updates the repository.

However my commit doesn't finish my work. At this point we build again, but this time on an integration machine based on the mainline code. Only when this build succeeds can we say that my changes are done. There is always a chance that I missed something on my machine and the repository wasn't properly updated. Only when my committed changes build successfully on the integration is my job done. This integration build can be executed manually by me, or done automatically by CruiseControl.

If a clash occurs between two developers, it is usually caught when the second developer to commit builds their updated working copy. If not the integration build should fail. Either way the error is detected rapidly. At this point the most important task is to fix it, and get the build working properly again. In a Continuous Integration environment you should never have a failed integration build stay failed for long. A good team should have many correct builds a day. Bad builds do occur from time to time, but should be quickly fixed.

The result of doing this is that there is a stable piece of software that works properly and contains few bugs. Everybody develops off that shared stable base and never gets so far away from that base that it takes very long to integrate back with it. Less time is spent trying to find bugs because they show up quickly.


Practices of Continuous Integration

The story above is the overview of CI and how it works in daily life. Getting all this to work smoothly is obviously rather more than that. I'll focus now on the key practices that make up effective CI.

Maintain a Single Source Repository.

Software projects involve lots of files that need to be orchestrated together to build a product. Keeping track of all of these is a major effort, particularly when there's multiple people involved. So it's not surprising that over the years software development teams have built tools to manage all this. These tools - called Source Code Management tools, configuration management, version control systems, repositories, or various other names - are an integral part of most development projects. The sad and surprising thing is that they aren't part of all projects. It is rare, but I do run into projects that don't use such a system and use some messy combination of local and shared drives.

So as a simple basis make sure you get a decent source code management system. Cost isn't an issue as good quality open-source tools are available. The current open source repository of choice is Subversion. (The older open-source tool CVS is still widely used, and is much better than nothing, but Subversion is the modern choice.) Interestingly as I talk to developers I know most commercial source code management tools are liked less than Subversion. The only tool I've consistently heard people say is worth paying for is Perforce.

Once you get a source code management system, make sure it is the well known place for everyone to go get source code. Nobody should ever ask "where is the foo-whiffle file?" Everything should be in the repository.

Although many teams use repositories a common mistake I see is that they don't put everything in the repository. If people use one they'll put code in there, but everything you need to do a build should in there including: test scripts, properties files, database schema, install scripts, and third party libraries. I've known projects that check their compilers into the repository (important in the early days of flaky C++ compilers). The basic rule of thumb is that you should be able to walk up to the project with a virgin machine, do a checkout, and be able to fully build the system. Only a minimal amount of things should be on the virgin machine - usually things that are large, complicated to install, and stable. An operating system, Java development environment, or base database system are typical examples.

You must put everything required for a build in the source control system, however you may also put other stuff that people generally work with in there too. IDE configurations are good to put in there because that way it's easy for people to share the same IDE setups.

One of the features of version control systems is that they allow you to create multiple branches, to handle different streams of development. This is a useful, nay essential, feature - but it's frequently overused and gets people into trouble. Keep your use of branches to a minimum. In particular have a mainline: a single branch of the project currently under development. Pretty much everyone should work off this mainline most of the time. (Reasonable branches are bug fixes of prior production releases and temporary experiments.)

In general you should store in source control everything you need to build anything, but nothing that you actually build. Some people do keep the build products in source control, but I consider that to be a smell - an indication of a deeper problem, usually an inability to reliably recreate builds.

Automate the Build

Getting the sources turned into a running system can often be a complicated process involving compilation, moving files around, loading schemas into the databases, and so on. However like most tasks in this part of software development it can be automated - and as a result should be automated. Asking people to type in strange commands or clicking through dialog boxes is a waste of time and a breeding ground for mistakes.

Automated environments for builds are a common feature of systems. The Unix world has had make for decades, the Java community developed Ant, the .NET community has had Nant and now has MSBuild. Make sure you can build and launch your system using these scripts using a single command.

A common mistake is not to include everything in the automated build. The build should include getting the database schema out of the repository and firing it up in the execution environment. I'll elaborate my earlier rule of thumb: anyone should be able to bring in a virgin machine, check the sources out of the repository, issue a single command, and have a running system on their machine.

Build scripts come in various flavors and are often particular to a platform or community, but they don't have to be. Although most of our Java projects use Ant, some have used Ruby (the Ruby Rake system is a very nice build script tool). We got a lot of value from automating an early Microsoft COM project with Ant.

A big build often takes time, you don't want to do all of these steps if you've only made a small change. So a good build tool analyzes what needs to be changed as part of the process. The common way to do this is to check the dates of the source and object files and only compile if the source date is later. Dependencies then get tricky: if one object file changes those that depend on it may also need to be rebuilt. Compilers may handle this kind of thing, or they may not.

Depending on what you need, you may need different kinds of things to be built. You can build a system with or without test code, or with different sets of tests. Some components can be built stand-alone. A build script should allow you to build alternative targets for different cases.

Many of us use IDEs, and most IDEs have some kind of build management process within them. However these files are always proprietary to the IDE and often fragile. Furthermore they need the IDE to work. It's okay for IDE users set up their own project files and use them for individual development. However it's essential to have a master build that is usable on a server and runnable from other scripts. So on a Java project we're okay with having developers build in their IDE, but the master build uses Ant to ensure it can be run on the development server.

Make Your Build Self-Testing

Traditionally a build means compiling, linking, and all the additional stuff required to get a program to execute. A program may run, but that doesn't mean it does the right thing. Modern statically typed languages can catch many bugs, but far more slip through that net.

A good way to catch bugs more quickly and efficiently is to include automated tests in the build process. Testing isn't perfect, of course, but it can catch a lot of bugs - enough to be useful. In particular the rise of Extreme Programming (XP) and Test Driven Development (TDD) have done a great deal to popularize self-testing code and as a result many people have seen the value of the technique.

Regular readers of my work will know that I'm a big fan both both TDD and XP, however I want to stress that neither of these approaches are necessary to gain the benefits of self-testing code. Both of these approaches make a point of writing tests before you write the code that makes them pass - in this mode the tests are as much about exploring the design of the system as they are about bug catching. This is a Good Thing, but it's not necessary for the purposes of Continuous Integration, where we have the weaker requirement of self-testing code. (Although TDD is my preferred way of producing self-testing code.)

For self-testing code you need a suite of automated tests that can check a large part of the code base for bugs. The tests need to be able to be kicked off from a simple command and to be self-checking. The result of running the test suite should indicate if any tests failed. For a build to be self-testing the failure of a test should cause the build to fail.

Over the last few years the rise of TDD has popularized the XUnit family of open-source tools which are ideal for this kind of testing. XUnit tools have proved very valuable to us at ThoughtWorks and I always suggest to people that they use them. These tools, pioneered by Kent Beck, make it very easy for you to set up a fully self-testing environment.

XUnit tools are certainly the starting point for making your code self-testing. You should also look out for other tools that focus on more end-to-end testing, there's quite a range of these out there at the moment including FIT, Selenium, Sahi, Watir, FITnesse, and plenty of others that I'm not trying to comprehensively list here.

Of course you can't count on tests to find everything. As it's often been said: tests don't prove the absence of bugs. However perfection isn't the only point at which you get payback for a self-testing build. Imperfect tests, run frequently, are much better than perfect tests that are never written at all.

Everyone Commits Every Day

Integration is primarily about communication. Integration allows developers to tell other developers about the changes they have made. Frequent communication allows people to know quickly as changes develop.

The one prerequisite for a developer committing to the mainline is that they can correctly build their code. This, of course, includes passing the build tests. As with any commit cycle the developer first updates their working copy to match the mainline, resolves any conflicts with the mainline, then builds on their local machine. If the build passes, then they are free to commit to the mainline.

By doing this frequently, developers quickly find out if there's a conflict between two developers. The key to fixing problems quickly is finding them quickly. With developers committing every few hours a conflict can be detected within a few hours of it occurring, at that point not much has happened and it's easy to resolve. Conflicts that stay undetected for weeks can be very hard to resolve.

The fact that you build when you update your working copy means that you detect compilation conflicts as well as textual conflicts. Since the build is self-testing, you also detect conflicts in the running of the code. The latter conflicts are particularly awkward bugs to find if they sit for a long time undetected in the code. Since there's only a few hours of changes between commits, there's only so many places where the problem could be hiding. Furthermore since not much has changed you can use diff-debugging to help you find the bug.

My general rule of thumb is that every developer should commit to the repository every day. In practice it's often useful if developers commit more frequently than that. The more frequently you commit, the less places you have to look for conflict errors, and the more rapidly you fix conflicts.

Frequent commits encourage developers to break down their work into small chunks of a few hours each. This helps track progress and provides a sense of progress. Often people initially feel they can't do something meaningful in just a few hours, but we've found that mentoring and practice helps them learn.

Every Commit Should Build the Mainline on an Integration Machine

Using daily commits, a team gets frequent tested builds. This ought to mean that the mainline stays in a healthy state. In practice, however, things still do go wrong. One reason is discipline, people not doing an update and build before they commit. Another is environmental differences between developers' machines.

As a result you should ensure that regular builds happen on an integration machine and only if this integration build succeeds should the commit be considered to be done. Since the developer who commits is responsible for this, that developer needs to monitor the mainline build so they can fix it if it breaks. A corollary of this is that you shouldn't go home until the mainline build has passed with any commits you've added late in the day.

There are two main ways I've seen to ensure this: using a manual build or a continuous integration server.

The manual build approach is the simplest one to describe. Essentially it's a similar thing to the local build that a developer does before the commit into the repository. The developer goes to the integration machine, checks out the head of the mainline (which now houses his last commit) and kicks off the integration build. He keeps an eye on its progress, and if the build succeeds he's done with his commit. (Also see Jim Shore's description.)

A continuous integration server acts as a monitor to the repository. Every time a commit against the repository finishes the server automatically checks out the sources onto the integration machine, initiates a build, and notifies the committer of the result of the build. The committer isn't done until she gets the notification - usually an email.

At ThoughtWorks, we're big fans of continuous integration servers - indeed we led the development of CruiseControl and CruiseControl.net, the widely used open-source CI servers. ThoughtWorkers like Paul Julius, Jason Yip, and Owen Rodgers are still active committers to these open source projects. We use CruiseControl on nearly every project we do and have been very happy with the results.

Not everyone prefers to use a CI server. Jim Shore gave a well argued description of why he prefers the manual approach. I agree with him that CI is much more than just installing Cruise Control. All the practices here need to be in play to do Continuous Integration effectively. But equally many teams who do CI well find Cruise Control a helpful tool.

Many organizations do regular builds on a timed schedule, such as every night. This is not the same thing as a continuous build and isn't enough for continuous integration. The whole point of continuous integration is to find problems as soon as you can. Nightly builds mean that bugs lie undetected for a whole day before anyone discovers them. Once they are in the system that long, it takes a long time to find and remove them.

A key part of doing a continuous build is that if the mainline build fails, it needs to be fixed right away. The whole point of working with CI is that you're always developing on a known stable base. It's not a bad thing for the mainline build to break, although if it's happening all the time it suggests people aren't being careful enough about updating and building locally before a commit. When the mainline build does break, however, it's important that it gets fixed fast.

When teams are introducing CI, often this is one of the hardest things to sort out. Early on a team can struggle to get into the regular habit of working mainline builds, particularly if they are working on an existing code base. Patience and steady application does seem to regularly do the trick, so don't get discouraged.

Keep the Build Fast

The whole point of Continuous Integration is to provide rapid feedback. Nothing sucks the blood of a CI activity more than a build that takes a long time. Here I must admit a certain crotchety old guy amusement at what's considered to be a long build. Most of my colleagues consider a build that takes an hour to be totally unreasonable. I remember teams dreaming that they could get it so fast - and occasionally we still run into cases where it's very hard to get builds to that speed.

For most projects, however, the XP guideline of a ten minute build is perfectly within reason. Most of our modern projects achieve this. It's worth putting in concentrated effort to make it happen, because every minute you reduce off the build time is a minute saved for each developer every time they commit. Since CI demands frequent commits, this adds up to a lot of time.

If you're staring at a one hour build time, then getting to a faster build may seem like a daunting prospect. It can even be daunting to work on a new project and think about how to keep things fast. For enterprise applications, at least, we've found the usual bottleneck is testing - particularly tests that involve external services such as a database.

Probably the most crucial step is to start working on setting up a staged build. The idea behind a staged build (also known as build pipeline) is that there are in fact multiple builds done in sequence. The commit to the mainline triggers the first build - what I call the commit build. The commit build is the build that's needed when someone commits to the mainline. The commit build is the one that has to be done quickly, as a result it will take a number of shortcuts that will reduce the ability to detect bugs. The trick is to balance the needs of bug finding and speed so that a good commit build is stable enough for other people to work on.

Once the commit build is good then other people can work on the code with confidence. However there are further, slower, tests that you can start to do. Additional machines can run further testing routines on the build that take longer to do.

A simple example of this is a two stage build. The first stage would do the compilation and run tests that are more localized unit tests with the database completely stubbed out. Such tests can run very fast, keeping within the ten minute guideline. However any bugs that involve larger scale interactions, particularly those involving the real database, won't be found. The second stage build runs a different suite of tests that do hit the real database and involve more end-to-end behavior. This suite might take a couple of hours to run.

In this scenario people use the first stage as the commit build and use this as their main CI cycle. The second-stage build is a secondary build which runs when it can, picking up the latest good commit build for further testing. If the secondary build fails, then this doesn't have the same 'stop everything' quality, but the team does aim to fix such bugs as rapidly as possible, while keeping the commit build running. Indeed the secondary build doesn't have to stay good, as long as each known bug is identified and dealt with in a next few days.

If the secondary build detects a bug, that's a sign that the commit build could do with another test. As much as possible you want to ensure that any secondary build failure leads to new tests in the commit build that would have caught the bug, so the bug stays fixed in the commit build. This way the commit tests are strengthened whenever something gets past them. There are cases where there's no way to build a fast-running test that exposes the bug, so you may decide to only test for that condition in the secondary build. Most of time, fortunately, you can add suitable tests to the commit build.

This example is of a two-stage build, but the basic principle can be extended to any number of later builds. The builds after the commit build can also be done in parallel, so if you have two hours of secondary tests you can improve responsiveness by having two machines that run half the tests each. By using parallel secondary builds like this you can introduce all sorts of further automated testing, including performance testing, into the regular build process. (I've run into a lot of interesting techniques around this as I've visited various ThoughtWorks projects over the last couple of years - I'm hoping to persuade some of the developers to write these up.)

Test in a Clone of the Production Environment

The point of testing is to flush out, under controlled conditions, any problem that the system will have in production. A significant part of this is the environment within which the production system will run. If you test in a different environment, every difference results in a risk that what happens under test won't happen in production.

As a result you want to set up your test environment to be as exact a mimic of your production environment as possible. Use the same database software, with the same versions, use the same version of operating system. Put all the appropriate libraries that are in the production environment into the test environment, even if the system doesn't actually use them. Use the same IP addresses and ports, run it on the same hardware.

Well, in reality there are limits. If you're writing desktop software it's not practicable to test in a clone of every possible desktop with all the third party software that different people are running. Similarly some production environments may be prohibitively expensive to duplicate (although I've often come across false economies by not duplicating moderately expensive environments). Despite these limits your goal should still be to duplicate the production environment as much as you can, and to understand the risks you are accepting for every difference between test and production.

If you have a pretty simple setup without many awkward communications, you may be able to run your commit build in a mimicked environment. Often, however, you need to use test doubles because systems respond slowly or intermittently. As a result its common to have a very artificial environment for the commit tests for speed, and use a production clone for secondary testing.

I've noticed a growing interest in using virtualization to make it easy to put together test environments. Virtualized machines can be saved with all the necessary elements baked into the virtualization. It's then relatively straightforward to install the latest build and run tests. Furthermore this can allow you to run multiple tests on one machine, or simulate multiple machines in a network on a single machine. As the performance penalty of virtualization decreases, this option makes more and more sense.

Make it Easy for Anyone to Get the Latest Executable

One of the most difficult parts of software development is making sure that you build the right software. We've found that it's very hard to specify what you want in advance and be correct; people find it much easier to see something that's not quite right and say how it needs to be changed. Agile development processes explicitly expect and take advantage of this part of human behavior.

To help make this work, anyone involved with a software project should be able to get the latest executable and be able to run it: for demonstrations, exploratory testing, or just to see what changed this week.

Doing this is pretty straightforward: make sure there's a well known place where people can find the latest executable. It may be useful to put several executables in such a store. For the very latest you should put the latest executable to pass the commit tests - such an executable should be pretty stable providing the commit suite is reasonably strong.

If you are following a process with well defined iterations, it's usually wise to also put the end of iteration builds there too. Demonstrations, in particular, need software that whose features are familiar, so then it's usually worth sacrificing the very latest for something that the demonstrator knows how to operate.

Everyone can see what's happening

Continuous Integration is all about communication, so you want to ensure that everyone can easily see the state of the system and the changes that have been made to it.

One of the most important things to communicate is the state of the mainline build. If you're using CruiseControl there's a built in web site that will show you if there's a build in progress and what was the state of the last mainline build. Many teams like to make this even more apparent by hooking up a continuous display to the build system - lights that glow green when the build works, or red if it fails are popular. A particularly common touch is red and green lava lamps - not just do these indicate the state of the build, but also how long it's been in that state. Bubbles on a red lamp indicate the build's been broken for too long. Each team makes its own choices on these build sensors - it's good to be playful with your choice (recently I saw someone experimenting with a dancing rabbit.)

If you're using a manual CI process, this visibility is still essential. The monitor of the physical build machine can show the status of the mainline build. Often you have a build token to put on the desk of whoever's currently doing the build (again something silly like a rubber chicken is a good choice). Often people like to make a simple noise on good builds, like ringing a bell.

CI servers' web pages can carry more information than this, of course. CruiseControl provides an indication not just of who is building, but what changes they made. Cruise also provides a history of changes, allowing team members to get a good sense of recent activity on the project. I know team leads who like to use this to get a sense of what people have been doing and keep a sense of the changes to the system.

Another advantage of using a web site is that those that are not co-located can get a sense of the project's status. In general I prefer to have everyone actively working on a project sitting together, but often there are peripheral people who like to keep an eye on things. It's also useful for groups to aggregate together build information from multiple projects - providing a simple and automated status of different projects.

Good information displays are not only those on a computer screens. One of my favorite displays was for a project that was getting into CI. It had a long history of being unable to make stable builds. We put a calendar on the wall that showed a full year with a small square for each day. Every day the QA group would put a green sticker on the day if they had received one stable build that passed the commit tests, otherwise a red square. Over time the calendar revealed the state of the build process showing a steady improvement until green squares were so common that the calendar disappeared - its purpose fulfilled.

Automate Deployment

To do Continuous Integration you need multiple of environments, one to run commit tests, one or more to run secondary tests. Since you are moving executables between these environments multiple times a day, you'll want to do this automatically. So it's important to have scripts that will allow you to deploy the application into any environment easily.

A natural consequence of this is that you should also have scripts that allow you to deploy into production with similar ease. You may not be deploying into production every day (although I've run into projects that do), but automatic deployment helps both speed up the process and reduce errors. It's also a cheap option since it just uses the same capabilities that you use to deploy into test environments.

If you deploy into production one extra automated capability you should consider is automated rollback. Bad things do happen from time to time, and if smelly brown substances hit rotating metal, it's good to be able to quickly go back to the last known good state. Being able to automatically revert also reduces a lot of the tension of deployment, encouraging people to deploy more frequently and thus get new features out to users quickly. (The ruby on rails community developed a tool called Capistrano that is a good example of a tool that does this sort of thing.)

In clustered environments I've seen rolling deployments where the new software is deployed to one node at a time, gradually replacing the application over the course of a few hours.

See Related Article: Evolutionary Database Design

A common roadblock for many people doing frequent releases is database migration. Database changes are awkward because you can't just change database schemas, you also have to ensure data is correctly migrated. This article describes techniques used by my colleague Pramod Sadalage to do automated refactoring and migration of databases. The article is an early attempt the capture the information that's described in more detail by Pramod and Scott Amblers book on refactoring databases[ambler-sadalage].

A particularly interesting variation of this that I've come across with public web application is the idea of deploying a trial build to a subset of users. The team then sees how the trial build is used before deciding whether to deploy it to the full user population. This allows you to test out new features and user-interfaces before committing to a final choice. Automated deployment, tied into good CI discipline, is essential to making this work.


Benefits of Continuous Integration

On the whole I think the greatest and most wide ranging benefit of Continuous Integration is reduced risk. My mind still floats back to that early software project I mentioned in my first paragraph. There they were at the end (they hoped) of a long project, yet with no real idea of how long it would be before they were done.

The trouble with deferred integration is that it's very hard to predict how long it will take to do, and worse it's very hard to see how far you are through the process. The result is that you are putting yourself into a complete blind spot right at one of tensest parts of a project - even if you're one the rare cases where you aren't already late.

Continuous Integration completely finesses this problem. There's no long integration, you completely eliminate the blind spot. At all times you know where you are, what works, what doesn't, the outstanding bugs you have in your system.

Bugs - these are the nasty things that destroy confidence and mess up schedules and reputations. Bugs in deployed software make users angry with you. Bugs in work in progress get in your way, making it harder to get the rest of the software working correctly.

Continuous Integrations doesn't get rid of bugs, but it does make them dramatically easier to find and remove. In this respect it's rather like self-testing code. If you introduce a bug and detect it quickly it's far easier to get rid of. Since you've only changed a small bit of the system, you don't have far to look. Since that bit of the system is the bit you just worked, it's fresh in your memory - again making it easier to find the bug. You can also use diff debugging - comparing the current version of the system to an earlier one that didn't have the bug.

Bugs are also cumulative. The more bugs you have, the harder it is to remove each one. This is partly because you get bug interactions, where failures show as the result of multiple faults - making each fault harder to find. It's also psychological - people have less energy to find and get rid of bugs when there are many of them - a phenomenon that the Pragmatic Programmers call the Broken Windows syndrome.

As a result projects with Continuous Integration tend to have dramatically less bugs, both in production and in process. However I should stress that the degree of this benefit is directly tied to how good your test suite is. You should find that it's not too difficult to build a test suite that makes a noticeable difference. Usually, however, it takes a while before a team really gets to the low level of bugs that they have the potential to reach. Getting there means constantly working on and improving your tests.

If you have continuous integration, it removes one of the biggest barriers to frequent deployment. Frequent deployment is valuable because is allows your users to get new features more rapidly, to give more rapid feedback on those features, and generally become more collaborative in the development cycle. This helps break down the barriers between customers and development - barriers which I believe are the biggest barriers to successful software development.


Introducing Continuous Integration

So you fancy trying out Continuous Integration - where do you start? The full set of practices I outlined above give you the full benefits - but you don't need to start with all of them.

There's no fixed recipe here - much depends on the nature of your setup and team. But here are a few things that we've learned to get things going.

One of the first steps is to get the build automated. Get everything you need into source control get it so that you can build the whole system with a single command. For many projects this is not a minor undertaking - yet it's essential for any of the other things to work. Initially you may only do build occasionally on demand, or just do an automated nightly build. While these aren't continuous integration an automated nightly build is a fine step on the way.

Introduce some automated testing into you build. Try to identify the major areas where things go wrong and get automated tests to expose those failures. Particularly on an existing project it's hard to get a really good suite of tests going rapidly - it takes time to build tests up. You have to start somewhere though - all those cliches about Rome's build schedule apply.

Try to speed up the commit build. Continuous Integration on a build of a few hours is better than nothing, but getting down to that magic ten minute number is much better. This usually requires some pretty serious surgery on your code base to do as you break dependencies on slow parts of the system.

If you are starting a new project, begin with Continuous Integration from the beginning. Keep an eye on build times and take action as soon as you start going slower than the ten minute rule. By acting quickly you'll make the necessary restructurings before the code base gets so big that it becomes a major pain.

Above all get some help. Find someone who has done Continuous Integration before to help you. Like any new technique it's hard to introduce it when you don't know what the final result looks like. It may cost money to get a mentor, but you'll also pay in lost time and productivity if you don't do it. (Disclaimer / Advert - yes we at ThoughtWorks do do some consultancy in this area. After all we've made most of the mistakes that there are to make.)


Final Thoughts

In the years since Matt and I wrote the original paper on this site, Continuous Integration has become a mainstream technique for software development. Hardly any ThoughtWorks projects goes without it - and we see others using CI all over the world. I've hardly ever heard negative things about the approach - unlike some of the more controversial Extreme Programming practices.

If you're not using Continuous Integration I strongly urge you give it a try. If you are, maybe there are some ideas in this article that can help you do it more effectively. We've learned a lot about Continuous Integration in the last few years, I hope there's still more to learn and improve.



Acknowledgments

First and foremost to Kent Beck and my many colleagues on the Chrysler Comprehensive Compensation (C3) project. This was my first chance to see Continuous Integration in action with a meaningful amount of unit tests. It showed me what was possible and gave me an inspiration that led me for many years.

Thanks to Matt Foemmel, Dave Rice, and everyone else who built and maintained Continuous Integration on Atlas. That project was a sign of CI on a larger scale and showed the benefits it made to an existing project.

Paul Julius, Jason Yip, Owen Rodgers, Mike Roberts and many other open source contributors have participated in building some variant of Cruise Control. Although the tool isn't essential, many teams find it helpful. Cruise has played a big part in popularizing and enabling software developers to use Continuous Integration.

One of the reasons I work at ThoughtWorks is to get good access to practical projects done by talented people. Nearly every project I've visited has given tasty morsels of continuous integration information.

Significant Revisions

01 May 06: Complete rewrite of article to bring it up to date and to clarify the description of the approach.

10 Sep 00: Original version published.

© Copyright Martin Fowler, all rights reserved

Sunday, February 26, 2006

URL Rewriting in ASP.NET

Scott Mitchell

4GuysFromRolla.com


March 2004


Applies to:

Microsoft® ASP.NET


Summary: Examines how to perform dynamic URL rewriting with Microsoft ASP.NET. URL rewriting is the process of intercepting an incoming Web request and automatically redirecting it to a different URL. Discusses the various techniques for implementing URL rewriting, and examines real-world scenarios of URL rewriting. (31 printed pages)


Download the source code for this article.


Contents


Introduction

Common Uses of URL Rewriting

What Happens When a Request Reaches IIS

Implementing URL Rewriting

Building a URL Rewriting Engine

Performing Simple URL Rewriting with the URL Rewriting Engine

Creating Truly "Hackable" URLs

Conclusion

Related Books


Introduction


Take a moment to look at some of the URLs on your website. Do you find URLs like http://yoursite.com/info/dispEmployeeInfo.aspx?EmpID=459-099&type=summary? Or maybe you have a bunch of Web pages that were moved from one directory or website to another, resulting in broken links for visitors who have bookmarked the old URLs. In this article we'll look at using URL rewriting to shorten those ugly URLs to meaningful, memorable ones, by replacing http://yoursite.com/info/dispEmployeeInfo.aspx?EmpID=459-099&type=summary with something like http://yoursite.com/people/sales/chuck.smith. We'll also see how URL rewriting can be used to create an intelligent 404 error.


URL rewriting is the process of intercepting an incoming Web request and redirecting the request to a different resource. When performing URL rewriting, typically the URL being requested is checked and, based on its value, the request is redirected to a different URL. For example, in the case where a website restructuring caused all of the Web pages in the /people/ directory to be moved to a /info/employees/ directory, you would want to use URL rewriting to check if a Web request was intended for a file in the /people/ directory. If the request was for a file in the /people/ directory, you'd want to automatically redirect the request to the same file, but in the /info/employees/ directory instead.


With classic ASP, the only way to utilize URL rewriting was to write an ISAPI filter or to buy a third-party product that offered URL rewriting capabilities. With Microsoft® ASP.NET, however, you can easily create your own URL rewriting software in a number of ways. In this article we'll examine the techniques available to ASP.NET developers for implementing URL rewriting, and then turn to some real-world uses of URL rewriting. Before we delve into the technological specifics of URL rewriting, let's first take a look at some everyday scenarios where URL rewriting can be employed.


Common Uses of URL Rewriting


Creating data-driven ASP.NET websites often results in a single Web page that displays a subset of the database's data based on querystring parameters. For example, in designing an e-commerce site, one of your tasks would be to allow users to browse through the products for sale. To facilitate this, you might create a page called displayCategory.aspx that would display the products for a given category. The category's products to view would be specified by a querystring parameter. That is, if the user wanted to browse the Widgets for sale, and all Widgets had a had a CategoryID of 5, the user would visit: http://yousite.com/displayCategory.aspx?CategoryID=5.


There are two downsides to creating a website with such URLs. First, from the end user's perspective, the URL http://yousite.com/displayCategory.aspx?CategoryID=5 is a mess. Usability expert Jakob Neilsen recommends that URLs be chosen so that they:



  • Are short.

  • Are easy to type.

  • Visualize the site structure.

  • "Hackable," allowing the user to navigate through the site by hacking off parts of the URL.


I would add to that list that URLs should also be easy to remember. The URL http://yousite.com/displayCategory.aspx?CategoryID=5 meets none of Neilsen's criteria, nor is it easy to remember. Asking users to type in querystring values makes a URL hard to type and makes the URL "hackable" only by experienced Web developers who have an understanding of the purpose of querystring parameters and their name/value pair structure.


A better approach is to allow for a sensible, memorable URL, such as http://yoursite.com/products/Widgets. By just looking at the URL you can infer what will be displayed—information about Widgets. The URL is easy to remember and share, too. I can tell my colleague, "Check out yoursite.com/products/Widgets," and she'll likely be able to bring up the page without needing to ask me again what the URL was. (Try doing that with, say, an Amazon.com page!) The URL also appears, and should behave, "hackable." That is, if the user hacks of the end of the URL, and types in http://yoursite.com/products, they should see a listing of all products, or at least a listing of all categories of products they can view.


Note For a prime example of a "hackable" URL, consider the URLs generated by many blog engines. To view the posts for January 28, 2004, one visits a URL like http://someblog.com/2004/01/28. If the URL is hacked down to http://someblog.com/2004/01, the user will see all posts for January 2004. Cutting it down further to http://someblog.com/2004 will display all posts for the year 2004.


In addition to simplifying URLs, URL rewriting is also often used to handle website restructuring that would otherwise result in numerous broken links and outdated bookmarks.


What Happens When a Request Reaches IIS


Before we examine exactly how to implement URL rewriting, it's important that we have an understanding of how incoming requests are handled by Microsoft® Internet Information Services (IIS). When a request arrives at an IIS Web server, IIS examines the requested file's extension to determine how handle the request. Requests can be handled natively by IIS—as are HTML pages, images, and other static content—or IIS can route the request to an ISAPI extension. (An ISAPI extension is an unmanaged, compiled class that handles an incoming Web request. Its task is to generate the content for the requested resource.)


For example, if a request comes in for a Web page named Info.asp, IIS will route the message to the asp.dll ISAPI extension. This ISAPI extension will then load the requested ASP page, execute it, and return its rendered HTML to IIS, which will then send it back to the requesting client. For ASP.NET pages, IIS routes the message to the aspnet_isapi.dll ISAPI extension. The aspnet_isapi.dll ISAPI extension then hands off processing to the managed ASP.NET worker process, which processes the request, returning the ASP.NET Web page's rendered HTML.


You can customize IIS to specify what extensions are mapped to what ISAPI extensions. Figure 1 shows the Application Configuration dialog box from the Internet Information Services Administrative Tool. Note that the ASP.NET-related extensions—.aspx, .ascx, .config, .asmx, .rem, .cs, .vb, and others—are all mapped to the aspnet_isapi.dll ISAPI extension.


1


Figure 1. Configured mappings for file extensions


A thorough discussion of how IIS manages incoming requests is a bit beyond the scope of this article. A great, in-depth discussion, though, can be found in Michele Leroux Bustamante's article Inside IIS and ASP.NET. It's important to understand that the ASP.NET engine gets its hands only on incoming Web requests whose extensions are explicitly mapped to the aspnet_isapi.dll in IIS.


Examining Requests with ISAPI Filters


In addition to mapping the incoming Web request's file extension to the appropriate ISAPI extension, IIS also performs a number of other tasks. For example, IIS attempts to authenticate the user making the request and determine if the authenticated user has authorization to access the requested file. During the lifetime of handling a request, IIS passes through several states. At each state, IIS raises an event that can be programmatically handled using ISAPI filters.


Like ISAPI extensions, ISAPI filters are blocks of unmanaged code installed on the Web server. ISAPI extensions are designed to generate the response for a request to a particular file type. ISAPI filters, on the other hand, contain code to respond to events raised by IIS. ISAPI filters can intercept and even modify the incoming and outgoing data. ISAPI filters have numerous applications, including:



  • Authentication and authorization.

  • Logging and monitoring.

  • HTTP compression.

  • URL rewriting.


While ISAPI filters can be used to perform URL rewriting, this article examines implementing URL rewriting using ASP.NET. However, we will discuss the tradeoffs between implementing URL rewriting as an ISAPI filter versus using techniques available in ASP.NET.


What Happens When a Request Enters the ASP.NET Engine


Prior to ASP.NET, URL rewriting on IIS Web servers needed to be implemented using an ISAPI filter. URL rewriting is possible with ASP.NET because the ASP.NET engine is strikingly similar to IIS. The similarities arise because the ASP.NET engine:



  1. Raises events as it processes a request.

  2. Allows an arbitrary number of HTTP modules handle the events that are raised, akin to IIS's ISAPI filters.

  3. Delegates rendering the requested resource to an HTTP handler, which is akin to IIS's ISAPI extensions.


Like IIS, during the lifetime of a request the ASP.NET engine fires events signaling its change from one state of processing to another. The BeginRequest event, for example, is fired when the ASP.NET engine first responds to a request. The AuthenticateRequest event fires next, which occurs when the identity of the user has been established. (There are numerous other events—AuthorizeRequest, ResolveRequestCache, and EndRequest, among others. These events are events of the System.Web.HttpApplication class; for more information consult the HttpApplication Class Overview technical documentation.)


As we discussed in the previous section, ISAPI filters can be created to respond to the events raised by IIS. In a similar vein, ASP.NET provides HTTP modules that can respond to the events raised by the ASP.NET engine. An ASP.NET Web application can be configured to have multiple HTTP modules. For each request processed by the ASP.NET engine, each configured HTTP module is initialized and allowed to wire up event handlers to the events raised during the processing of the request. Realize that there are a number of built-in HTTP modules utilized on each an every request. One of the built-in HTTP modules is the FormsAuthenticationModule, which first checks to see if forms authentication is being used and, if so, whether the user is authenticated or not. If not, the user is automatically redirected to the specified logon page.


Recall that with IIS, an incoming request is eventually directed to an ISAPI extension, whose job it is to return the data for the particular request. For example, when a request for a classic ASP Web page arrives, IIS hands off the request to the asp.dll ISAPI extension, whose task it is to return the HTML markup for the requested ASP page. The ASP.NET engine utilizes a similar approach. After initializing the HTTP modules, the ASP.NET engine's next task is to determine what HTTP handler should process the request.


All requests that pass through the ASP.NET engine eventually arrive at an HTTP handler or an HTTP handler factory (an HTTP handler factory simply returns an instance of an HTTP handler that is then used to process the request). The final HTTP handler renders the requested resource, returning the response. This response is sent back to IIS, which then returns it to the user that made the request.


ASP.NET includes a number of built-in HTTP handlers. The PageHandlerFactory, for example, is used to render ASP.NET Web pages. The WebServiceHandlerFactory is used to render the response SOAP envelopes for ASP.NET Web services. The TraceHandler renders the HTML markup for requests to trace.axd.


Figure 2 illustrates how a request for an ASP.NET resource is handled. First, IIS receives the request and dispatches it to aspnet_isapi.dll. Next, the ASP.NET engine initializes the configured HTTP modules. Finally, the proper HTTP handler is invoked and the requested resource is rendered, returning the generated markup back to IIS and back to the requesting client.


2


Figure 2. Request processing by IIS and ASP.NET


Creating and Registering Custom HTTP Modules and HTTP Handlers


Creating custom HTTP modules and HTTP handlers are relatively simple tasks, which involve created a managed class that implements the correct interface. HTTP modules must implement the System.Web.IHttpModule interface, while HTTP handlers and HTTP handler factories must implement the System.Web.IHttpHandler interface and System.Web.IHttpHandlerFactory interface, respectively. The specifics of creating HTTP handlers and HTTP modules is beyond the scope of this article. For a good background, read Mansoor Ahmed Siddiqui's article, HTTP Handlers and HTTP Modules in ASP.NET.


Once a custom HTTP module or HTTP handler has been created, it must be registered with the Web application. Registering HTTP modules and HTTP handlers for an entire Web server requires only a simple addition to the machine.config file; registering an HTTP module or HTTP handler for a specific Web application involves adding a few lines of XML to the application's Web.config file.


Specifically, to add an HTTP module to a Web application, add the following lines in the Web.config's configuration/system.web section:


<httpModules>     <add type="type" name="name" />  </httpModules>  

The type value provides the assembly and class name of the HTTP module, whereas the name value provides a friendly name by which the HTTP module can be referred to in the Global.asax file.


HTTP handlers and HTTP handler factories are configured by the <httpHandlers> tag in the Web.config's configuration/system.web section, like so:


<httpHandlers>     <add verb="verb" path="path" type="type" />  </httpHandlers>  

Recall that for each incoming request, the ASP.NET engine determines what HTTP handler should be used to render the request. This decision is made based on the incoming requests verb and path. The verb specifies what type of HTTP request was made—GET or POST—whereas the path specifies the location and filename of the file requested. So, if we wanted to have an HTTP handler handle all requests—either GET or POST—for files with the .scott extension, we'd add the following to the Web.config file:


<httpHandlers>     <add verb="*" path="*.scott" type="type" />  </httpHandlers>  

where type was the type of our HTTP handler.


Note When registering HTTP handlers, it is important to ensure that the extensions used by the HTTP handler are mapped in IIS to the ASP.NET engine. That is, in our .scott example, if the .scott extension is not mapped in IIS to the aspnet_isapi.dll ISAPI extension, a request for the file foo.scott will result in IIS attempting to return the contents of the file foo.scott. In order for the HTTP handler to process this request, the .scott extension must be mapped to the ASP.NET engine. The ASP.NET engine, then, will route the request correctly to the appropriate HTTP handler.

For more information on registering HTTP modules and HTTP handlers, be sure to consult the <httpModules> element documentation along with the <httpHandlers> element documentation.


Implementing URL Rewriting


URL rewriting can be implemented either with ISAPI filters at the IIS Web server level, or with either HTTP modules or HTTP handlers at the ASP.NET level. This article focuses on implementing URL rewriting with ASP.NET, so we won't be delving into the specifics of implementing URL rewriting with ISAPI filters. There are, however, numerous third-party ISAPI filters available for URL rewriting, such as:



Implementing URL rewriting at the ASP.NET level is possible through the System.Web.HttpContext class's RewritePath() method. The HttpContext class contains HTTP-specific information about a specific HTTP request. With each request received by the ASP.NET engine, an HttpContext instance is created for that request. This class has properties like: Request and Response, which provide access to the incoming request and outgoing response; Application and Session, which provide access to application and session variables; User, which provides information about the authenticated user; and other related properties.


With the Microsoft® .NET Framework Version 1.0, the RewritePath() method accepts a single string, the new path to use. Internally, the HttpContext class's RewritePath(string) method updates the Request object's Path and QueryString properties. In addition to RewritePath(string), the .NET Framework Version 1.1 includes another form of the RewritePath() method, one that accepts three string input parameters. This alternate overloaded form not only sets the Request object's Path and QueryString properties, but also sets internal member variables that are used to compute the Request object's values for its PhysicalPath, PathInfo, and FilePath properties.


To implement URL rewriting in ASP.NET, then, we need to create an HTTP module or HTTP handler that:



  1. Checks the requested path to determine if the URL needs to be rewritten.

  2. Rewrites the path, if needed, by calling the RewritePath() method.


For example, imagine that our website had information each employee, accessible through /info/employee.aspx?empID=employeeID. To make the URLs more "hackable," we might decide to have employee pages accessible by: /people/EmployeeName.aspx. Here is a case where we'd want to use URL rewriting. That is, when the page /people/ScottMitchell.aspx was requested, we'd want to rewrite the URL so that the page /info/employee.aspx?empID=1001 was used instead.


URL Rewriting with HTTP Modules


When performing URL rewriting at the ASP.NET level you can use either an HTTP module or an HTTP handler to perform the rewriting. When using an HTTP module, you must decide at what point during the request's lifecycle to check to see if the URL needs to be rewritten. At first glance, this may seem to be an arbitrary choice, but the decision can impact your application in both significant and subtle ways. The choice of where to perform the rewrite matters because the built-in ASP.NET HTTP modules use the Request object's properties to perform their duties. (Recall that rewriting the path alters the Request object's property values.) These germane built-in HTTP modules and the events they tie into are listed below:

























HTTP Module Event Description
FormsAuthenticationModule AuthenticateRequest Determines if the user is authenticated using forms authentication. If not, the user is automatically redirected to the specified logon page.
FileAuthorizationMoudle AuthorizeRequest When using Windows authentication, this HTTP module checks to ensure that the Microsoft® Windows® account has adequate rights for the resource requested.
UrlAuthorizationModule AuthorizeRequest Checks to make sure the requestor can access the specified URL. URL authorization is specified through the <authorization> and <location> elements in the Web.config file.

Recall that the BeginRequest event fires before AuthenticateRequest, which fires before AuthorizeRequest.


One safe place that URL rewriting can be performed is in the BeginRequest event. That means that if the URL needs to be rewritten, it will have done so by the time any of the built-in HTTP modules run. The downside to this approach arises when using forms authentication. If you've used forms authentication before, you know that when the user visits a restricted resource, they are automatically redirected to a specified login page. After successfully logging in, the user is sent back to the page they attempted to access in the first place.


If URL rewriting is performed in the BeginRequest or AuthenticateRequest events, the login page will, when submitted, redirect the user to the rewritten page. That is, imagine that a user types into their browser window, /people/ScottMitchell.aspx, which is rewritten to /info/employee.aspx?empID=1001. If the Web application is configured to use forms authentication, when the user first visits /people/ScottMitchell.aspx, first the URL will be rewritten to /info/employee.aspx?empID=1001; next, the FormsAuthenticationModule will run, redirecting the user to the login page, if needed. The URL the user will be sent to upon successfully logging in, however, will be /info/employee.aspx?empID=1001, since that was the URL of the request when the FormsAuthenticationModule ran.


Similarly, when performing rewriting in the BeginRequest or AuthenticateRequest events, the UrlAuthorizationModule sees the rewritten URL. That means that if you use <location> elements in your Web.config file to specify authorization for specific URLs, you will have to refer to the rewritten URL.


To fix these subtleties, you might decide to perform the URL rewriting in the AuthorizeRequest event. While this approach fixes the URL authorization and forms authentication anomalies, it introduces a new wrinkle: file authorization no longer works. When using Windows authentication, the FileAuthorizationModule checks to make sure that the authenticated user has the appropriate access rights to access the specific ASP.NET page.


Imagine if a set of users does not have Windows-level file access to C:\Inetput\wwwroot\info\employee.aspx; if such users attempt to visit /info/employee.aspx?empID=1001, then they will get an authorization error. However, if we move the URL rewriting to the AuthenticateRequest event, when the FileAuthorizationModule checks the security settings, it still thinks the file being requested is /people/ScottMitchell.aspx, since the URL has yet to be rewritten. Therefore, the file authorization check will pass, allowing this user to view the content of the rewritten URL, /info/employee.aspx?empID=1001.


So, when should URL rewriting be performed in an HTTP module? It depends on what type of authentication you're employing. If you're not using any authentication, then it doesn't matter if URL rewriting happens in BeginRequest, AuthenticateRequest, or AuthorizeRequest. If you are using forms authentication and are not using Windows authentication, place the URL rewriting in the AuthorizeRequest event handler. Finally, if you are using Windows authentication, schedule the URL rewriting during the BeginRequest or AuthenticateRequest events.


URL Rewriting in HTTP Handlers


URL rewriting can also be performed by an HTTP handler or HTTP handler factory. Recall that an HTTP handler is a class responsible for generating the content for a specific type of request; an HTTP handler factory is a class responsible for returning an instance of an HTTP handler that can generate the content for a specific type of request.


In this article we'll look at creating a URL rewriting HTTP handler factory for ASP.NET Web pages. HTTP handler factories must implement the IHttpHandlerFactory interface, which includes a GetHandler() method. After initializing the appropriate HTTP modules, the ASP.NET engine determines what HTTP handler or HTTP handler factory to invoke for the given request. If an HTTP handler factory is to be invoked, the ASP.NET engine calls that HTTP handler factory's GetHandler() method passing in the HttpContext for the Web request, along with some other information. The HTTP handler factory, then, must return an object that implements IHttpHandler that can handle the request.


To perform URL rewriting through an HTTP handler, we can create an HTTP handler factory whose GetHandler() method checks the requested path to determine if it needs to be rewritten. If it does, it can call the passed-in HttpContext object's RewritePath() method, as discussed earlier. Finally, the HTTP handler factory can return the HTTP handler returned by the System.Web.UI.PageParser class's GetCompiledPageInstance() method. (This is the same technique by which the built-in ASP.NET Web page HTTP handler factory, PageHandlerFactory, works.)


Since all of the HTTP modules will have been initialized prior to the custom HTTP handler factory being instantiated, using an HTTP handler factory presents the same challenges when placing the URL rewriting in the latter stages of the events—namely, file authorization will not work. So, if you rely on Windows authentication and file authorization, you will want to use the HTTP module approach for URL rewriting.


Over the next section we'll look at building a reusable URL rewriting engine. Following our examination of the URL rewriting engine—which is available in this article's code download—we'll spend the remaining two sections examining real-world uses of URL rewriting. First we'll look at how to use the URL rewriting engine and look at a simple URL rewriting example. Following that, we'll utilize the power of the rewriting engine's regular expression capabilities to provide truly "hackable" URLs.


Building a URL Rewriting Engine


To help illustrate how to implement URL rewriting in an ASP.NET Web application, I created a URL rewriting engine. This rewriting engine provides the following functionality:



  • The ASP.NET page developer utilizing the URL rewriting engine can specify the rewriting rules in the Web.config file.

  • The rewriting rules can use regular expressions to allow for powerful rewriting rules.

  • URL rewriting can be easily configured to use an HTTP module or an HTTP handler.


In this article we will examine URL rewriting with just the HTTP module. To see how HTTP handlers can be used to perform URL rewriting, consult the code available for download with this article.


Specifying Configuration Information for the URL Rewriting Engine


Let's examine the structure of the rewrite rules in the Web.config file. First, you'll need to indicate in the Web.config file if you want perform URL rewriting with the HTTP module or the HTTP handler. In the download, the Web.config file contains two entries that have been commented out:


<!--  <httpModules>     <add type="URLRewriter.ModuleRewriter, URLRewriter"           name="ModuleRewriter" />  </httpModules>  -->    <!--  <httpHandlers>     <add verb="*" path="*.aspx"           type="URLRewriter.RewriterFactoryHandler, URLRewriter" />  </httpHandlers>  -->  

Comment out the <httpModules> entry to use the HTTP module for rewriting; comment out the <httpHandlers> entry instead to use the HTTP handler for rewriting.


In addition to specifying whether the HTTP module or HTTP handler is used for rewriting, the Web.config file contains the rewriting rules. A rewriting rule is composed of two strings: the pattern to look for in the requested URL, and the string to replace the pattern with, if found. This information is expressed in the Web.config file using the following syntax:


<RewriterConfig>     <Rules>     <RewriterRule>        <LookFor>pattern to look for</LookFor>        <SendTo>string to replace pattern with</SendTo>     </RewriterRule>     <RewriterRule>        <LookFor>pattern to look for</LookFor>        <SendTo>string to replace pattern with</SendTo>     </RewriterRule>     ...     </Rules>  </RewriterConfig>  

Each rewrite rule is expressed by a <RewriterRule> element. The pattern to search for is specified by the <LookFor> element, while the string to replace the found pattern with is entered in the <SentTo> element. These rewrite rules are evaluated from top to bottom. If a match is found, the URL is rewritten and the search through the rewriting rules terminates.


When specifying patterns in the <LookFor> element, realize that regular expressions are used to perform the matching and string replacement. (In a bit we'll look at a real-world example that illustrates how to search for a pattern using regular expressions.) Since the pattern is a regular expression, be sure to escape any characters that are reserved characters in regular expressions. (Some of the regular expression reserved characters include: ., ?, ^, $, and others. These can be escaped by being preceded with a backslash, like \. to match a literal period.)


URL Rewriting with an HTTP Module


Creating an HTTP module is as simple as creating a class that implements the IHttpModule interface. The IHttpModule interface defines two methods:



  • Init(HttpApplication). This method fires when the HTTP module is initialized. In this method you'll wire up event handlers to the appropriate HttpApplication events.

  • Dispose(). This method is invoked when the request has completed and been sent back to IIS. Any final cleanup should be performed here.


To facilitate creating an HTTP module for URL rewriting, I started by creating an abstract base class, BaseModuleRewriter. This class implements IHttpModule. In the Init() event, it wires up the HttpApplication's AuthorizeRequest event to the BaseModuleRewriter_AuthorizeRequest method. The BaseModuleRewriter_AuthorizeRequest method calls the class's Rewrite() method passing in the requested Path along with the HttpApplication object that was passed into the Init() method. The Rewrite() method is abstract, meaning that in the BaseModuleRewriter class, the Rewrite() method has no method body; rather, the class being derived from BaseModuleRewriter must override this method and provide a method body.


With this base class in place, all we have to do now is to create a class derived from BaseModuleRewriter that overrides Rewrite() and performs the URL rewriting logic there. The code for BaseModuleRewriter is shown below.


public abstract class BaseModuleRewriter : IHttpModule  {     public virtual void Init(HttpApplication app)     {        // WARNING!  This does not work with Windows authentication!        // If you are using Windows authentication,         // change to app.BeginRequest        app.AuthorizeRequest += new            EventHandler(this.BaseModuleRewriter_AuthorizeRequest);     }       public virtual void Dispose() {}       protected virtual void BaseModuleRewriter_AuthorizeRequest(       object sender, EventArgs e)     {        HttpApplication app = (HttpApplication) sender;        Rewrite(app.Request.Path, app);     }       protected abstract void Rewrite(string requestedPath,        HttpApplication app);  }  

Notice that the BaseModuleRewriter class performs URL rewriting in the AuthorizeRequest event. Recall that if you use Windows authentication with file authorization, you will need to change this so that URL rewriting is performed in either the BeginRequest or AuthenticateRequest events.


The ModuleRewriter class extends the BaseModuleRewriter class and is responsible for performing the actual URL rewriting. ModuleRewriter contains a single overridden method—Rewrite()—which is shown below:


protected override void Rewrite(string requestedPath,      System.Web.HttpApplication app)  {     // get the configuration rules     RewriterRuleCollection rules =        RewriterConfiguration.GetConfig().Rules;       // iterate through each rule...     for(int i = 0; i < rules.Count; i++)     {        // get the pattern to look for, and         // Resolve the Url (convert ~ into the appropriate directory)        string lookFor = "^" +           RewriterUtils.ResolveUrl(app.Context.Request.ApplicationPath,           rules[i].LookFor) + "$";          // Create a regex (note that IgnoreCase is set...)        Regex re = new Regex(lookFor, RegexOptions.IgnoreCase);          // See if a match is found        if (re.IsMatch(requestedPath))        {           // match found - do any replacement needed           string sendToUrl =   RewriterUtils.ResolveUrl(app.Context.Request.ApplicationPath,               re.Replace(requestedPath, rules[i].SendTo));             // Rewrite the URL           RewriterUtils.RewriteUrl(app.Context, sendToUrl);           break;      // exit the for loop        }     }  }  

The Rewrite() method starts with getting the set of rewriting rules from the Web.config file. It then iterates through the rewrite rules one at a time, and for each rule, it grabs its LookFor property and uses a regular expression to determine if a match is found in the requested URL.


If a match is found, a regular expression replace is performed on the requested path with the value of the SendTo property. This replaced URL is then passed into the RewriterUtils.RewriteUrl() method. RewriterUtils is a helper class that provides a couple of static methods used by both the URL rewriting HTTP module and HTTP handler. The RewriterUrl() method simply calls the HttpContext object's RewriteUrl() method.


Note You may have noticed that when performing the regular expression match and replacement, a call to RewriterUtils.ResolveUrl() is made. This helper method simply replaces any instances of ~ in the string with the value of the application's path.

The entire code for the URL rewriting engine is available for download with this article. We've examined the most germane pieces, but there are other components as well, such as classes for deserializing the XML-formatted rewriting rules in the Web.config file into an object, as well as the HTTP handler factory for URL rewriting. The remaining three sections of this article examine real-world uses of URL rewriting.


Performing Simple URL Rewriting with the URL Rewriting Engine


To demonstrate the URL rewriting engine in action, let's build an ASP.NET Web application that utilizes simple URL rewriting. Imagine that we work for a company that sells assorted products online. These products are broken down into the following categories:





























Category ID Category Name
1 Beverages
2 Condiments
3 Confections
4 Dairy Products
... ...

Assume we already have created an ASP.NET Web page called ListProductsByCategory.aspx that accepts a Category ID value in the querystring and displays all of the products belonging to that category. So, users who wanted to view our Beverages for sale would visit ListProductsByCategory.aspx?CategoryID=1, while users who wanted to view our Dairy Products would visit ListProductsByCategory.aspx?CategoryID=4. Also assume we have a page called ListCategories.aspx, which lists the categories of products for sale.


Clearly this is a case for URL rewriting, as the URLs a user is presented with do not carry any significance for the user, nor do they provide any "hackability." Rather, let's employ URL rewriting so that when a user visits /Products/Beverages.aspx, their URL will be rewritten to ListProductsByCategory.aspx?CategoryID=1. We can accomplish this with the following URL rewriting rule in the Web.config file:


<RewriterConfig>     <Rules>        <!-- Rules for Product Lister -->        <RewriterRule>           <LookFor>~/Products/Beverages\.aspx</LookFor>           <SendTo>~/ListProductsByCategory.aspx?CategoryID=1</SendTo>        </RewriterRule>        <RewriterRule>     </Rules>  </RewriterConfig>  

As you can see, this rule searches to see if the path requested by the user was /Products/Beverages.aspx. If it was, it rewrites the URL as /ListProductsByCategory.aspx?CategoryID=1.


Note Notice that the <LookFor> element escapes the period in Beverages.aspx. This is because the <LookFor> value is used in a regular expression pattern, and period is a special character in regular expressions meaning "match any character," meaning a URL of /Products/BeveragesQaspx, for example, would match. By escaping the period (using \.) we are indicating that we want to match a literal period, and not any old character.

With this rule in place, when a user visits /Products/Beverages.aspx, they will be shown the beverages for sale. Figure 3 shows a screenshot of a browser visiting /Products/Beverages.aspx. Notice that in the browser's Address bar the URL reads /Products/Beverages.aspx, but the user is actually seeing the contents of ListProductsByCategory.aspx?CategoryID=1. (In fact, there doesn't even exist a /Products/Beverages.aspx file on the Web server at all!)


3


Figure 3. Requesting category after rewriting URL


Similar to /Products/Beverages.aspx, we'd next add rewriting rules for the other product categories. This simply involves adding additional <RewriterRule> elements within the <Rules> element in the Web.config file. Consult the Web.config file in the download for the complete set of rewriting rules for the demo.


To make the URL more "hackable," it would be nice if a user could simply hack off the Beverages.aspx from /Products/Beverages.aspx and be shown a listing of the product categories. At first glance, this may appear a trivial task—just add a rewriting rule that maps /Products/ to /ListCategories.aspx. However, there is a fine subtlety—you must first create a /Products/ directory and add an empty Default.aspx file in the /Products/ directory.


To understand why these extra steps need to be performed, recall that the URL rewriting engine is at the ASP.NET level. That is, if the ASP.NET engine is never given the opportunity to process the request, there's no way the URL rewriting engine can inspect the incoming URL. Furthermore, remember that IIS hands off incoming requests to the ASP.NET engine only if the requested file has an appropriate extension. So if a user visits /Products/, IIS doesn't see any file extension, so it checks the directory to see if there exists a file with one of the default filenames. (Default.aspx, Default.htm, Default.asp, and so on. These default filenames are defined in the Documents tab of the Web Server Properties dialog box in the IIS Administration dialog box.) Of course, if the /Products/ directory doesn't exist, IIS will return an HTTP 404 error.


So, we need to create the /Products/ directory. Additionally, we need to create a single file in this directory, Default.aspx. This way, when a user visits /Products/, IIS will inspect the directory, see that there exists a file named Default.aspx, and then hand off processing to the ASP.NET engine. Our URL rewriter, then, will get a crack at rewriting the URL.


After creating the directory and Default.aspx file, go ahead and add the following rewriting rule to the <Rules> element:


<RewriterRule>     <LookFor>~/Products/Default\.aspx</LookFor>     <SendTo>~/ListCategories.aspx</SendTo>  </RewriterRule>  

With this rule in place, when a user visits /Products/ or /Products/Default.aspx, they will see the listing of product categories, shown in Figure 4.


4


Figure 4. Adding "hackability" to the URL


Handling Postbacks


If the URLs you are rewriting contain a server-side Web Form and perform postbacks, when the form posts back, the underlying URL will be used. That is, if our user enters into their browser, /Products/Beverages.aspx, they will still see in their browser's Address bar, /Products/Beverages.aspx, but they will be shown the content for ListProductsByCategory.aspx?CategoryID=1. If ListProductsByCategory.aspx performs a postback, the user will be posted back to ListProductsByCategory.aspx?CategoryID=1, not /Products/Beverages.aspx. This won't break anything, but it can be disconcerting from the user's perspective to see the URL change suddenly upon clicking a button.


The reason this behavior happens is because when the Web Form is rendered, it explicitly sets its action attribute to the value of the file path in the Request object. Of course, by the time the Web Form is rendered, the URL has been rewritten from /Products/Beverages.aspx to ListProductsByCategory.aspx?CategoryID=1, meaning the Request object is reporting that the user is visiting ListProductsByCategory.aspx?CategoryID=1. This problem can be fixed by having the server-side form simply not render an action attribute. (Browsers, by default, will postback if the form doesn't contain an action attribute.)


Unfortunately, the Web Form does not allow you to explicitly specify an action attribute, nor does it allow you to set some property to disable the rendering of the action attribute. Rather, we'll have to extend the System.Web.HtmlControls.HtmlForm class ourselves, overriding the RenderAttribute() method and explicitly indicating that it not render the action attribute.


Thanks to the power of inheritance, we can gain all of the functionality of the HtmlForm class and only have to add a scant few lines of code to achieve the desired behavior. The complete code for the custom class is shown below:


namespace ActionlessForm {    public class Form : System.Web.UI.HtmlControls.HtmlForm    {       protected override void RenderAttributes(HtmlTextWriter writer)       {          writer.WriteAttribute("name", this.Name);          base.Attributes.Remove("name");            writer.WriteAttribute("method", this.Method);          base.Attributes.Remove("method");            this.Attributes.Render(writer);            base.Attributes.Remove("action");            if (base.ID != null)             writer.WriteAttribute("id", base.ClientID);       }    }  }  

The code for the overridden RenderAttributes() method simply contains the exact code from the HtmlForm class's RenderAttributes() method, but without setting the action attribute. (I used Lutz Roeder's Reflector to view the source code of the HtmlForm class.)


Once you have created this class and compiled it, to use it in an ASP.NET Web application, start by adding it to the Web application's References folder. Then, to use it in place of the HtmlForm class, simply add the following to the top of your ASP.NET Web page:


<%@ Register TagPrefix="skm" Namespace="ActionlessForm"      Assembly="ActionlessForm" %>  

Then, where you have <form runat="server">, replace that with:


<skm:Form id="Form1" method="post" runat="server">  

and replace the closing </form> tag with:


</skm:Form>  

You can see this custom Web Form class in action in ListProductsByCategory.aspx, which is included in this article's download. Also included in the download is a Visual Studio .NET project for the action-less Web Form.


Note If the URL you are rewriting to does not perform a postback, there's no need to use this custom Web Form class.

Creating Truly "Hackable" URLs


The simple URL rewriting demonstrated in the previous section showed how easily the URL rewriting engine can be configured with new rewriting rules. The true power of the rewriting rules, though, shines when using regular expressions, as we'll see in this section.


Blogs are becoming more and more popular these days, and it seems everyone has their own blog. If you are not familiar with blogs, they are often-updated personal pages that typically serve as an online journal. Most bloggers simply write about their day-to-day happenings, others focus on blogging about a specific theme, such as movie reviews, a sports team, or a computer technology.


Depending on the author, blogs are updated anywhere from several times a day to once every week or two. Typically the blog homepage shows the most recent 10 entries, but virtually all blogging software provides an archive through which visitors can read older postings. Blogs are a great application for "hackable" URLs. Imagine while searching through the archives of a blog you found yourself at the URL /2004/02/14.aspx. Would you be terribly surprised if you found yourself reading the posts made on February 14th, 2004? Furthermore, you might want to view all posts for February 2004, in which case you might try hacking the URL to /2004/02/. To view all 2004 posts, you might try visiting /2004/.


When maintaining a blog, it would be nice to provide this level of URL "hackability" to your visitors. While many blog engines provide this functionality, let's look at how it can be accomplished using URL rewriting.


First, we need a single ASP.NET Web page that will show blog entries by day, month, or year. Assume we have such a page, ShowBlogContent.aspx, that takes in querystring parameters year, month, and day. To view the posts for February 14th, 2004, we could visit ShowBlogContent.aspx?year=2004&month=2&day=14. To view all posts for February 2004, we'd visit ShowBlogContent.aspx?year=2004&month=2. Finally, to see all posts for the year 2004, we'd navigate to ShowBlogContent.aspx?year=2004. (The code for ShowBlogContent.aspx can be found in this article's download.)


So, if a user visits /2004/02/14.aspx, we need to rewrite the URL to ShowBlogContent.aspx?year=2004&month=2&day=14. All three cases—when the URL specifies a year, month, and day; when the URL specifies just the year and month; and when the URL specifies only the yea—can be handled with three rewrite rules:


<RewriterConfig>     <Rules>        <!-- Rules for Blog Content Displayer -->        <RewriterRule>           <LookFor>~/(\d{4})/(\d{2})/(\d{2})\.aspx</LookFor>           <SendTo>~/ShowBlogContent.aspx?year=$1&amp;month=$2&day=$3</SendTo>        </RewriterRule>        <RewriterRule>           <LookFor>~/(\d{4})/(\d{2})/Default\.aspx</LookFor>           <SendTo><![CDATA[~/ShowBlogContent.aspx?year=$1&month=$2]]></SendTo>        </RewriterRule>        <RewriterRule>           <LookFor>~/(\d{4})/Default\.aspx</LookFor>           <SendTo>~/ShowBlogContent.aspx?year=$1</SendTo>        </RewriterRule>     </Rules>  </RewriterConfig>  

These rewriting rules demonstrate the power of regular expressions. In the first rule, we look for a URL with the pattern (\d{4})/(\d{2})/(\d{2})\.aspx. In plain English, this matches a string that has four digits followed by a forward slash followed by two digits followed by a forward slash, followed by two digits followed by .aspx. The parenthesis around each digit grouping is vital—it allows us to refer to the matched characters inside those parentheses in the corresponding <SendTo> property. Specifically, we can refer back to the matched parenthetical groupings using $1, $2, and $3 for the first, second, and third parenthesis grouping, respectively.


Note Since the Web.config file is XML-formatted, characters like &, <, and > in the text portion of an element must be escaped. In the first rule's <SendTo> element, & is escaped to &amp;amp;. In the second rule's <SendTo>, an alternative technique is used—by using a <![CDATA[...]]> element, the contents inside do not need to be escaped. Either approach is acceptable and accomplishes the same end.

Figures 5, 6, and 7 show the URL rewriting in action. The data is actually being pulled from my blog, http://ScottOnWriting.NET. In Figure 5, the posts for November 7, 2003 are shown; in Figure 6 all posts for November 2003 are shown; Figure 7 shows all posts for 2003.


5


Figure 5. Posts for November 7, 2003


5


Figure 6. All posts for November 2003


6


Figure 7. All posts for 2003


Note The URL rewriting engine expects a regular expression pattern in the <LookFor> elements. If you are unfamiliar with regular expressions, consider reading an earlier article of mine, An Introduction to Regular Expressions. Also, a great place to get your hands on commonly used regular expressions, as well as a repository for sharing your own crafted regular expressions, is RegExLib.com.

Building the Requisite Directory Structure


When a request comes in for /2004/03/19.aspx, IIS notes the .aspx extension and routes the request to the ASP.NET engine. As the request moves through the ASP.NET engine's pipeline, the URL will get rewritten to ShowBlogContent.aspx?year=2004&month=03&day=19 and the visitor will see those blog entries for March 19, 2004. But what happens when the user navigates to /2004/03/? Unless there is a directory /2004/03/, IIS will return a 404 error. Furthermore, there needs to be a Default.aspx page in this directory so that the request is handed off to the ASP.NET engine.


So with this approach, you have to manually create a directory for each year in which there are blog entries, with a Default.aspx page in the directory. Additionally, in each year directory you need to manually create twelve more directories—01, 02, …, 12—each with a Default.aspx file. (Recall that we had to do the same thing—add a /Products/ directory with a Default.aspx file—in the previous demo so that visiting /Products/ correctly displayed ListCategories.aspx.)


Clearly, adding such a directory structure can be a pain. A workaround to this problem is to have all incoming IIS requests map to the ASP.NET engine. This way, even if when visiting the URL /2004/03/, IIS will faithfully hand off the request to the ASP.NET engine even if there does not exist a /2004/03/ directory. Using this approach, however, makes the ASP.NET engine responsible for handling all types of incoming requests to the Web server, including images, CSS files, external JavaScript files, Macromedia Flash files, and so on.


A thorough discussion of handling all file types is far beyond the scope of this article. For an example of an ASP.NET Web application that uses this technique, though, look into .Text, an open-source blog engine. .Text can be configured to have all requests mapped to the ASP.NET engine. It can handle serving all file types by using a custom HTTP handler that knows how to serve up typical static file types (images, CSS files, and so on).


Conclusion


In this article we examined how to perform URL rewriting at the ASP.NET-level through the HttpContext class's RewriteUrl() method. As we saw, RewriteUrl() updates the particular HttpContext's Request property, updating what file and path is being requested. The net effect is that, from the user's perspective, they are visiting a particular URL, but actually a different URL is being requested on the Web server side.


URLs can be rewritten either in an HTTP module or an HTTP handler. In this article we examined using an HTTP module to perform the rewriting, and looked at the consequences of performing the rewriting at different stages in the pipeline.


Of course, with ASP.NET-level rewriting, the URL rewriting can only happen if the request is successfully handed off from IIS to the ASP.NET engine. This naturally occurs when the user requests a page with a .aspx extension. However, if you want the person to be able to enter a URL that might not actually exist, but would rather rewrite to an existing ASP.NET page, you have to either create mock directories and Default.aspx pages, or configure IIS so that all incoming requests are blindly routed to the ASP.NET engine.


Related Books


ASP.NET: Tips, Tutorials, and Code


Microsoft ASP.NET Coding Strategies with the Microsoft ASP.NET Team


Essential ASP.NET with Examples in C#


Works consulted


URL rewriting is a topic that has received a lot of attention both for ASP.NET and competing server-side Web technologies. The Apache Web server, for instance, provides a module for URL rewriting called mod_rewrite. mod_rewrite is a robust rewriting engine, providing rewriting rules based on conditions such as HTTP headers and server variables, as well as rewriting rules that utilize regular expressions. For more information on mod_rewrite, check out A User's Guide to URL Rewriting with the Apache Web Server.


There are a number of articles on URL rewriting with ASP.NET. Rewrite.NET - A URL Rewriting Engine for .NET examines creating a URL rewriting engine that mimics mod_rewrite's regular expression rules. URL Rewriting With ASP.NET also gives a good overview of ASP.NET's URL rewriting capabilities. Ian Griffiths has a blog entry on some of the caveats associated with URL rewriting with ASP.NET, such as the postback issue discussed in this article. Both Fabrice Marguerie (read more) and Jason Salas (read more) have blog entires on using URL rewriting to boost search engine placement.




About the author

Scott Mitchell, author of five books and founder of 4GuysFromRolla.com, has been working with Microsoft Web technologies for the past five years. Scott works as an independent consultant, trainer, and writer. He can be reached at mitchell@4guysfromrolla.com or through his blog, which can be found at http://ScottOnWriting.NET.