Monthly Columns
 

Managing websites using Unix

Copyright © 1999 Nik Clayton

This is the first of several articles explaining how to use the tools provided by Unix and clones (such as the free BSD implementations, and the various different Linux distributions) to manage the contents of a website, such as the free webspace that ISPs often give to their customers.

There is nothing about the techniques described here (and in future articles) that limit them to small, personal, websites. The author has successfully used these approaches to manage sites with thousands of pages and half dozen active webmasters working on the site.

In this article the problem to be solved is explained, and the reader is introduced to the CVS suite of commands.


Introduction

Do you have some web space somewhere? Perhaps your ISP gives you a few megabytes of space to play with, or maybe your college or university lets you put up your own home page. If so, you may well already have experimented and created your own set of home pages.

You have doubtless discovered that the time required to maintain your site keeps growing as you change existing content, add new content, and delete or archive old content. Uploading pages to your webserver takes more and more time, and it can be difficult to keep track of what has changed since you last made an update.

Most worryingly, there is always the risk that you will accidentally delete something important before you have uploaded it, or remove something from your site before you should. Of course, every one takes backups, but this is still an inconveience.

This can get very complicated very quickly, even for sites that have only a few tens of pages. And if you have more (and have ever had to do this yourself) you can appreciate how large the job can get.

All this for something that is meant to be a fun pastime.

Fortunately, you already have all the tools you need to greatly ease this burden at your disposal. These articles will introduce you to those tools and show how they can be used together to make life easier for you. Eventually, you will be able to update your site with just one command.


Pre-requisites

It is assumed you already know HTML.

You don't need to have a web server installed. All the examples will work using the file:// protocol.

You should be comfortable working at the command prompt of whatever shell you like using. You should understand how to redirect input from files to commands, from commands to files, and piping input and output between different commands.


The problem

Before we even start thinking about the tools to use, it is important to understand the problem to solve.

Think this through. You have a website. You need to maintain this website. Maintaining the site involves adding new files, editing existing files, and removing (or moving) outdated files.

You probably don't want to be editing files on your live web site. Firstly, this is dangerous, because any mistakes you make are immediately visible to the outside world. Secondly, you might not even be able to do this if your ISP only lets you change your site using FTP (i.e., you don't have a shell account).

So you will need a work area, in which you can edit your site, experiment with new content and new file organisation for the site before they go live.

This work area will probably be cluttered with these experimental versions of your site, so you probably also need a test area (or staging area). When you want to test your site you will have to arrange to copy the appropriate files from your work area to the staging area.

When you are happy with the contents of the staging area you will need a way to copy files from the staging area to the live site. This might also involve removing files from the live site.

As well as your work area and staging area, you will need somewhere to store old copies of files that have been on your site. This might be older versions of your front page that you no longer use, or different designs that have been used across your site.

Or perhaps you have made some changes on your site that will only last for a short while (maybe it's Christmas, and you've changed the background on each page) and then you need to set everything back the way it was. We'll call this archive area ``the repository''.

Finally, you will probably find that there are a lot of repetitive tasks that you carry out when you check your site before making it live. For example, double-checking that all the internal links work, or validating your HTML.

I've found those to be the bare minimum for maintaining a web site. For small sites you can get away with not having a staging area, and test things in your work area as well. I find that quickly becomes unwieldy, and since small sites have a habit of becoming quite large quite quickly it is better to try and make things manageable from the start.

So the requirements list for a solution is starting to come together.

  • A repository for all the versions of your files.

  • A work area, to make changes in.

  • A staging area, to test in.

  • A way of copying different versions of files from the repository to the work area.

  • A way of copying files from the work area to the staging area.

  • A way of copying files from the staging area to the live site.

  • A way of automating certains tasks (like checking for dead links).

Keep this list in mind throughout the rest of these articles, as we'll steadily be ticking items off it.


The repository, and how to manage it

The repository is the area that is set aside to store all the versions of every file in your web site. Doing this means you can easily recover previous versions of files should you make mistakes, or should you need to restore the site to how it used to be on a particular date.

You are probably expecting this to be a vast directory structure, and you having to tediously copy files in and out of the repository as you change them in order to keep things in step.

Fortunately, this is not the case.

The repository is going to be managed by a package called CVS. You may have heard of it. The source code for FreeBSD, NetBSD, and OpenBSD is maintained using it[1], allowing the developers to examine any version of any file maintained by CVS, and make their changes so that other people can see the changes, who made them, what changed, and other useful pieces of information.

This obviously means that this is a big and scary program with hundreds of options and lots of documentation, right?

Right.

I don't think you were expecting that. It is true though. CVS is a large application (actually, it is a collection of different applications) and the manual is very long. Fortunately, there are only a few key concepts you need to understand at the moment.

Concept number 1. CVS maintains the file repository. The repository stores a copy of every single version of every file you have told CVS to add to the repository. Files in the repository are stored in a special format. You do not need to worry about this, CVS handles converting files from the repository format to your format and back again automatically[2].

Concept number 2. You don't edit the files in the repository directly. Instead, you tell CVS that you would like to check out one or more files from the repository. CVS retrieves them for you, and places them in the current directory. After you have edited the files and confirmed that the changes are correct, you can commit your changes. CVS then puts the changed files back in the repository, and handles updating version numbers, log entries, and so on.

Concept number 3. CVS identifies files by their filename, and by their revision number. The revision number starts at 1.1. The right hand number is incremented by 1 every time you commit to that file. When you check out a file from the repository CVS will normally give you the most recent revision of the file. However, you can tell CVS to give you the version of the file that corresponds with a specific revision number.

Concept number 4. Each time you commit a change, CVS also records a log entry. This is where you can provide additional information about what the change is, and why you made it. CVS gets the log entry by running your chosen editor (in the EDITOR environment variable). If you haven't set this variable then CVS will use vi(1).

That is all you need to know for the time being about how CVS works.


Creating the repository, the work area, and the staging area

Before you can start work you need to create the repository, the work area, and the staging area. The work area and staging area are normal directories. However, the repository must be created and configured by using CVS commands.

The rest of this article will assume that you are using ~/www/[3] as the base directory for your work. You need three directories in here, one for the repository, one for the work area, and one for the staging area.

Note: There is no reason why your repository, work area, and staging area need to be subdirectories from the same parent directory; it just makes these examples more convenient. Indeed, later articles will show you how the repository, work area, and staging area can be stored in different places, and even on different computers.

Example 1. Creating the intial directory structure

% cd
% mkdir www
% cd www
% mkdir cvs-rep
% mkdir mywebsite
% mkdir mystage

cvs-rep is the directory that will hold the contents of the CVS repository, mywebsite is the work area, and mystage is the staging area.

Tip: mywebsite and mystage are not particularly good names. For example, what if you wanted to store several sites in the repository?

I normally name the work area after the host and domain name I am working on. If I'm working on the site for www.foo.com then I would call the mywebsite directory www.foo.com.

You can follow this convention too if you like. However, for simplicity's sake, the rest of this article will assume that the work area is called mywebsite.

Having created the directory structure, you now need to create the CVS repository in which all your files will be stored. To do this, you will need to run some CVS commands.

cvs(1) is the master CVS command. Typically, you will run it like this:

cvs [options] {sub-command} [sub-command options]

Three of the CVS sub-commands you will shortly use are init, checkout and commit. They would be run as

% cvs init
% cvs checkout
% cvs commit
respectively.

Each CVS sub-command can take several parameters that control what it does. These parameters change depending on the sub-command you are using. In addition, the main cvs(1) command can also take parameters.

One of those parameters is -d /path/to/repository. Use this to specify the path to the CVS repository that you want the sub-command to operate on.

Because this is a parameter to cvs(1) it always comes between the cvs and the sub-command.

Use the init sub-command to tell CVS to create a new repository.

Example 2. Creating a new CVS repository

% cvs -d ~/www/cvs-rep init

Tip: Whenever you give CVS commands it needs to know where the repository is. You could use -d /path/to/repository each time. However, you can also use the CVSROOT environment variable. If you put the path to your repository in CVSROOT then CVS will use it for each command.

% setenv CVSROOT ~/www/cvs-rep
for csh(1) and tcsh(1) users, and
% CVSROOT=~/www/cvs-rep
% export CVSROOT
for sh(1), bash(1), and ksh(1) users. These commands can be added to your system startup files (.login or .profile for the different families of shells, respectively).

The rest of this article will assume that you have done this.

If you inspect the contents of ~/www/cvs-rep/, you should see that a new CVSROOT directory has been created. This is an administrative directory, used by the CVS commands themselves. You can edit these files to change CVS' behaviour, but that is a topic for another article.


Adding your work area to the repository

Before you can add files to the repository you need to add your work area to the repository so that CVS can start to manage it.

CVS does have an add subcommand. However, you don't use it for this step. Instead, you must use the import subcommand.

The import subcommand imports all the files in the current directory (and all subdirectories) into the repository. The files will be placed in their own directory, and will be identified by a vendor tag and a release tag.

vendor tag and release tag are CVS terms. They are particularly useful when you are using CVS to manage software releases, or the source code to third party software. Since we are doing neither, you can safely ignore what they mean, since this is the only time you will be using them.

Your import command should look like this:

Example 3. cvs import

% cd mywebsite
% cvs import mywebsite my-name start

First of all, change into the mywebsite directory (your work area, remember?). Then run the CVS command.

This will import all the files in the current directory (mywebsite) into a directory called mywebsite in the repository (the first parameter to the import subcommand). The vendor tag for these files will be my-name, and you should change this to something that represents you, such as your login username. The release tag for these files will be start.

Note: You need to explicitly include the name mywebsite on the command line, even though that's the name of the current directory. CVS will not use the current directory name automatically.

There is nothing that requires the name of the directory in the repository to be the same as the name of the current directory. I find it makes things easier to understand though.

CVS will prompt you for the log message. I normally type in Initial import and leave it at that.

When the import has finished, CVS will display:

No conflicts created by this import

If you look inside the cvs-rep directory you will see that a mywebsite directory has been created in there, which is exactly what you want.

There is one more thing you need to do. Although CVS has imported the mywebsite directory into the repository, it has no knowledge about the existing mywebsite directory.

You must remove the existing mywebsite directory, and then instruct CVS to checkout a copy of the mywebsite directory from the repository. CVS will then do this, and include some of its own administrative files in the newly checked out mywebsite directory. These will allow CVS to use your work area.

% cd ..
% rmdir mywebsite
% cvs checkout mywebsite

If you look, you should see that the mywebsite directory has been recreated. And if you look inside the mywebsite directory you should see a CVS subdirectory, which contains CVS' administrative files. This means that CVS can now operate inside this directory.

This is quite an involved procedure, but you only need to do it the first time you create a new repository.


Putting files in the work area, adding them to the repository

Now that you have created the repository you can go in to the work area and create a new file. Then you can add this file to the repository.

For this test, create a simple HTML file in the work area:

% cd mywebsite
% cat > index.html
<title>A quick test</title>

<p>This is a very small test file.</p>
^D

Note: ^D means hold down the CTRL key and press D.

You should now have index.html in mywebsite. You probably want to commit this to the repository. cvs commit index.html, right?

Almost right. Before you can commit a file for the first time, you need to tell CVS that this file exists. If CVS doesn't know that the file exists then it will ignore it. The subcommand to add a file to the repository is add.

Example 4. cvs add

% cvs add index.html
cvs add: scheduling file `index.html' for addition
cvs add: use 'cvs commit' to add this file permanently

Now that CVS knows that the file exists you can commit the file. When you commit the file CVS will want to know what entry you want in the log for this commit (the same way it wanted to know a log entry for the intial import). When you exit the editor CVS will finish committing the file.

Example 5. cvs commit

% cvs commit index.html
RCS file: /tmp/www/cvs-rep/mywebsite/index.html,v
done
Checking in index.html;
/tmp/www/cvs-rep/mywebsite/index.html,v  <--  index.html
initial revision: 1.1
done

Tip: If your commit messages are short, such as ``Fixed typo'' or ``Initial import'' you can list them on the command line. The commit and import subcommands have a -m commit message parameter that you can use. The previous commit could have been written as;

% cvs commit -m "Initial creation" index.html
which would have committed the file with a log entry of ``Initial creation''.

Don't forget to use quotes around the message, and be careful if your message includes characters that are special to your shell, such as *. Read your shell's manual to determine how best to include these, or just leave out the commit message and enter it using an editor as normal.

The file is now committed. Doubtless you want proof of this. The simplest way is to ask CVS to show you the commit log (or just log) for the file. The subcommand for this is log, so:

Example 6. cvs log

% cvs log index.html

RCS file: /tmp/www/cvs-rep/mywebsite/index.html,v
Working file: index.html
head: 1.1
branch:
locks: strict
access list:
symbolic names:
keyword substitution: kv
total revisions: 1;     selected revisions: 1
description:
----------------------------
revision 1.1
date: 1999/01/06 19:37:29;  author: nik;  state: Exp;
Initial creation.
======================================================================

The log output includes some status information about the file. You can ignore this for the time being. It also includes every log message entered for this file.


Removing the work area

When you have finished your work for the day you can remove your work area. You don't have to do this, but you might want to leave things tidy. The subcommand to do this is release, and it takes an optional parameter, -d, indicating that not only do you want to tell CVS that you've finished, but you want to delete your work area as well.

% cd ..
% cvs release -d mywebsite
You have [0] altered files in this repository.
Are you sure you want to release (and delete) directory `mywebsite': y

Retrieving files from the repository

Now that you've removed your work area you probably want it back. As you did originally, you can:

% cvs checkout mywebsite
cvs checkout: Updating mywebsite
U mywebsite/index.html

Notice that this time you have also checked out the files in the mywebsite directory (in this case just index.html). The U by the filename indicates that CVS is updating this file from the repository, overwriting any file by that name that might already have existed.


Managing revisions with CVS

CVS' real power comes to light when you have made several changes to files, and want to manage those changes.

First, make a few changes to index.html. It doesn't matter what these changes are---you could change the text, or add some new paragraphs. Just make the changes, and use cvs commit index.html to commit them. As before, you will be prompted for a log message to use for this commit. Enter an appropriate message and exit your editor for the commit to complete[4].

Now use the log subcommand to view the log for index.html (you might have to run the result through a pager, such as less(1), or more(1), to see the output a page at a time). You will see your new log message included (as well as the time and date it was made).

% cvs log index.html | more
RCS file: /home/nik/www/cvs-rep/mywebsite/index.html,v
Working file: index.html
head: 1.2
branch:
locks: strict
access list:
symbolic names:
keyword substitution: kv
total revisions: 2;     selected revisions: 2
description:
----------------------------
revision 1.2
date: 1999/01/06 19:47:20;  author: nik;  state: Exp;  lines: +2 -0
Added second paragraph.
----------------------------
revision 1.1
date: 1999/01/06 19:37:29;  author: nik;  state: Exp;
Initial creation.
======================================================================

These are my log messages. Notice that each log message is attached to a particular revision of the file. Also, see how the entry for revision 1.2 indicates that I added 2 lines to the file, and removed none (the lines: entry). I didn't enter this information myself; CVS worked it out for me.

One of the most frequent operations on different revisions of the same file is to see what has changed between revisions. The CVS subcommand diff shows the difference between two revisions of the same file. It takes two parameters, each starting with -r. The first parameter is the revision number that is to be considered to be the original version, and the second parameter is the revision number that should be considered as the newer version.

To see what changed between revisions 1.1 and 1.2 of index.html, do the following:

Example 7. cvs diff

% cvs diff -r 1.1 -r 1.2 index.html
Index: index.html
===================================================================
RCS file: /home/nik/www/cvs-rep/mywebsite/index.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -r1.1 -r1.2
3a4,5
> 
> <p>This file has two paragraphs.</p>

The output from this command is commonly called a diff.

This diff shows the changes between revision 1.1 and 1.2. The first few lines indicate which file this diff is for, and the actions that CVS is taking. The actual changes themselves are on the last three lines.

The line that starts 3a4,5 lists the coordinates in the file where I started making changes. The lines that start with > indicate lines that were added. If you deleted any lines from your file then they will be preceeded by < characters.

There are a number of different formats you can select for your diff output. Many people find the unified output easier to read. You can select the unified diff output format by passing the -u parameter to the diff subcommand, like so:

Example 8. cvs diff -u

% cvs diff -u -r 1.1 -r 1.2 index.html
Index: index.html
===================================================================
RCS file: /tmp/www/cvs-rep/mywebsite/index.html,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- index.html  1999/01/06 19:37:29     1.1
+++ index.html  1999/01/06 19:47:20     1.2
@@ -1,3 +1,5 @@
 <title>A quick test</title>
 
 <p>This is a very small test file.</p>
+
+<p>This file has two paragraphs.</p>

As you can see, the output is similar, but a little more comprehensible. For example, this output includes a few lines of context around my changes, so I can see where in the file they were made. It also includes the date and time the changes were made.

Perhaps most importantly, there is an explicit indication (the + signs) that lines have been added. If lines had been deleted they would have been preceeded by - signs.

The order that you give the two -r parameters to the diff subcommand is important. In the previous examples (-r 1.1 -r 1.2) the output shows you how, given revision 1.1, you would go to revision 1.2.

If you reverse the ordering of the two parameters the output will show you to go back to revision 1.1 from revision 1.2.

This is how you ``undo'' changes you have made to a file.

Suppose that you want to remove the changes you made at revision 1.2, and go back to revision 1.1. To do this, do the following:

Note: There is more than one way to do this. However, this is probably the simplest to understand.

First, take a unified diff between revisions 1.2 and 1.1 (yes, in that order).

% cvs diff -u -r 1.2 -r 1.1 index.html
Index: index.html
===================================================================
RCS file: /tmp/www/cvs-rep/mywebsite/index.html,v
retrieving revision 1.2
retrieving revision 1.1
diff -u -r1.2 -r1.1
--- index.html  1999/01/06 19:47:20     1.2
+++ index.html  1999/01/06 19:37:29     1.1
@@ -1,5 +1,3 @@
 <title>A quick test</title>

 <p>This is a very small test file.</p>
-
-<p>This file has two paragraphs.</p>

As you can see, this diff suggests that to go back to revision 1.1 from revision 1.2 I need to delete two lines (the lines preceeded by - signs---obviously your output will be different depending on the changes you made).

Rerun this command, saving the output to a file, diff.out:

% cvs diff -u -r 1.2 -r 1.1 index.html > diff.out

Now you need to know about another program, patch(1). patch(1) was written by Larry Wall (creator of, amongst other things, Perl) and is a tremendously useful program. It can read in a diff file and apply the changes in the diff to the original file. This is how updates to free software were distributed to users earlier in the life of the Internet---instead of having to redistribute the entire source code to the software again, you just distributed patches to the previous release. Typically these were much smaller than the size of the original release. FreeBSD actually started out as a collection of a hundred or so patches to an earlier OS, 386BSD.

Recall that diff.out contains the instructions to convert revision 1.2 of index.html back to 1.1. You need to apply these instructions using patch(1):

Example 9. patch

% patch < diff.out
patching file `index.html'

If you now look at index.html you should see that it is as it was at revision 1.1.

Finally, you need to commit the new index.html. As normal, you will need to provide a commit log message---tradition dictates that this follows the form Reverted to revision x.y.

% cvs commit index.html

You might be asking ``Why not simply delete revision 1.2 and pretend it never happened? By doing it like this you clutter up the repository files with mistakes.''

You could do this. CVS has a mechanism for removing complete revisions from the files so you can pretend they never happened (CVS calls this ``outdating'' a revision). However, that is not part of the philosophy behind CVS. The aim is that all your commits stay visible, even if you later decide to back them out. This can be useful historical information when you later come to look at a file's evolution.

index.html should now have three revisions (1.1, 1.2, and 1.3) and revisions 1.1 and 1.3 should be identical. You can check this for yourself by using the diff subcommand again. Since 1.1 and 1.3 are identical there should be no differences between them:

% cvs diff -u -r 1.1 -r 1.3 index.html
Index: index.html
===================================================================
RCS file: /home/nik/www/cvs-rep/mywebsite/index.html,v
retrieving revision 1.1
retrieving revision 1.3
diff -u -r1.1 -r1.3

CVS shows no differences, as expected.


Removing files

There may come a time when you need to remove a file from your work area and from the repository. Naturally, CVS supports this.

First, you must actually remove the file from your work area:

% rm index.html

Now you must tell CVS that you want the file removed. The subcommand to do this is remove:

Example 10. cvs remove

% cvs remove index.html
cvs remove: scheduling `index.html' for removal
cvs remove: use 'cvs commit' to remove this file permanently

Finally, you must confirm to CVS that you want this file removed using the commit subcommand. As with every other commit you should include a log message indicating why you have removed the file.

% cvs commit index.html

And the file is now removed.


Restoring removed files

Sooner or later you are going to delete a file from your work area that you didn't want to remove, and that you need to recover. Fortunately, CVS has mechanisms to let you do this, since files are not normally completely removed by CVS. All of these mechanisms use the update subcommand.

If you have deleted a file, and then used cvs remove to delete it as well, CVS won't have actually deleted your file from the repository. Instead, your file has been moved to a special area in the repository called the Attic.

To bring a file back from the Attic and in to the work area you need to know the revision number you want to resurrect.

For example, index.html (which we removed a few moments ago) had 4 revisions. 3 of those revisions were changes we made to the file. The fourth revision was the actual deletion of the file. This fourth revision will therefore have revision number 1.4, so the revision we want to retrieve is one less than that, or 1.3.

% cvs update -p -r 1.3 index.html > index.html

The -p parameter instructs the update subcommand to send the contents of the file to stdout, where you can redirect it back to a file to recreate it.

As normal, the -r parameter specifies the revision number you are interested in.

Once you have recreated the file you must use the add subcommand to tell CVS that it is back.

% cvs add index.html
cvs add: re-adding file index.html (in place of dead revision 1.4)
cvs add: use 'cvs commit' to add this file permanently

Notice how CVS has told you that this revision of the file will replace the dead revision 1.4.

Finally, you can commit this file back to the repository.

% cvs commit index.html

As normal, enter a log message explaining why you brought it back.

If you have accidentally deleted a file from your work area that is in the repository, but have not yet used cvs remove then the operation is simpler. Again, use the update subcommand:

% rm index.html
% cvs update index.html
cvs update: warning: index.html was lost
U index.html

CVS warns you that the file appears to have been lost from the work area, and then updates the work area for you, bringing the file back from the repository.

Obviously, this will not restore any uncommitted changes you might have made to the file.


What now?

Remember that check list of things we needed?

  • A repository for all the versions of your files.

  • A work area, to make changes in.

  • A staging area, to test in.

  • A way of copying different versions of files from the repository to the work area.

  • A way of copying files from the work area to the staging area.

  • A way of copying files from the staging area to the live site.

  • A way of automating certains tasks (like checking for dead links).

You can check off the following:

  • A repository for all the versions of your files.

  • A work area, to make changes in.

  • A staging area, to test in.

  • A way of copying different versions of files from the repository to the work area.

In the meantime I will leave you with two more pieces of information.

First, you probably have a wealth of CVS information at your fingertips, and already installed on your system. Try typing info cvs to bring up the GNU info browser on your system, and start working through the CVS manual. If you don't have the GNU info browser installed you can instead direct your web browser to the online CVS manual.

Second, you might already have a set of web pages that you want to experiment with, probably in your home directory in a directory called public_html. If this is the case, you can import all these files into CVS and experiment with CVS on your own files (without the risk of damaging them) by doing the following:

% cd ~/public_html
% cvs import mywebsite my-name start

This will import your public_html directory, its files, and all the subdirectories, into the CVS repository into the mywebsite directory. You can then change to the www directory created earlier and checkout mywebsite, which will now contain your personal web pages:

% cd ~/www
% cvs checkout mywebsite

Then go in to the mywebsite directory and experiment with committing changes to your files, using cvs log, removing files, bringing them back, and so on. Because this all happens away from your live site you can't do any damage.

This will only work if the repository directory and the public_html directory are on the same computer.


The next article

The next article in this series will start to look at make(1), and how you can use it to help transfer files from your work area to your staging area, and more importantly, why you would choose to use make(1) instead of a simpler shell script.


Acknowledgements

My thanks to David Wolfskill, Jonathan Michaels, K. Marsh, and Tom Hukins, who were generous enough to review early drafts of this article series and offer valuable comments and criticisms.

Notes

[1]

As well as many other software projects, including GNOME, KDE, SGMLTools, Mozilla...

[2]

If you are interested, the repository file format is documented in the rcsfile(5) manual page.

[3]

~ is the synonym for the path to your home directory. If your home directory is /home/nik/, then ~/www/ is the same as /home/nik/www/.

[4]

Or use the -m parameter, as described earlier.

Nik Clayton, nik@freebsd.org