This paper was originally published in the
Proceedings for The Third World Conference on System
Administration, Networking and Security, Washington DC,
The Good, The Bad and The Ugly.
Anecdotes from the System Administration Trenches
Making mistakes are never difficult. Making mistakes when
adminstrating many UNIX systema is certainly easy. In
spite of this. most papers tend to focus only on what
works. This paper is descriping ways I have seen and
experenced, which does not work.
I have been working with UNIX system administration and
security for over a decade, and like this environment
system very much. In some way, the UNIX operating system
is like a human being. It can made to do almost any task
adequately, however, it will do almost nothing perfectly.
UNIX also allows most tasks to be performed in a large
variety of ways, unfortunately, this also leave plenty of
room for mistakes.
According to an old saying, the way to avoid making
mistakes is to have experience. And how does one get
experience? By making mistakes! Learning from other
peoples mistakes sounds good in theory, but unfortunately
does not work too well in praxis. In spite of this, I
hope some people may gain some usefull insights from
reading this paper.
This paper is a tour down memory lane, revisiting some
of my mistakes and experiences. It is mostly about
mistakes done by system administrators, users or
management. However, it is also about how such mistakes
can be avoided, and how, in a general sense, how the work
in the trenches can be made more bearable.
Cleaning up the Hard Disk
I don't believe that there is any real UNIX System
Administrator anywhere who has not, at one time or
another, executed the famous rm -rf * in the
wrong place, and deleted a large number of files which
should never have been removed. In fact, when I meet
system administrators who claim that they have never ever
done anything like this, I make a mental note to make
sure that they do never get access to one of my systems;
having an unexperienced system administrator can be bad,
but having one who is afraid of admitting mistakes can be
down right lethal.
The all time worst I have seen, where at one computer
installation, where management had, in their infinitive
wisdom, decided that all key engineers should have the
root password for all the key servers. This
specific event happened a Friday late afternoon, where an
engineer had discovered that a server where almost
running out of space in the root directory. Being a
genially nice guy, he decided that he did not need to
disturb me, and that he could fix this himself. He
searched for some large file in the root directory which
could be removed. The two largest files he could find
where /boot and /vmunix , and as they
had not been modified for a long time, they would surely
be on the backup tape, and could therefore be removed
safely. After the files had been removed, just to be sure
that the disk space was recovered he decided to reboot
the system. Much to his surprise, the system would not
boot, and I got to spend too much of my weekend fixing
the problem (it where back in the ``good'' old days, when
you had to boot the system from tape).
This experience was the drop for me which made the cup
flow over. Ever since, I have been a firm believer in
users not having the root password for
any mission critical machine. Of cause the politics of
root password and superuser access has always
been at the forefront of the life in the system
The Root Of All Evil
It seems that almost everybody who uses a UNIX system
wants to be able to have root access. In some way this is
very understandable for it is part of the human nature to
strive to become better in at whatever we do. For a UNIX
user, it seems almost natural that the way to growth is
to go from being a normal user to become a superuser.
What most people forget, is that with the increased power
that a superuser enjoys, being able to play God in the
Unixverse, it also comes with a big responsibility to
balance the need of all users of the system.
I don't know if I am just become better over the years
at playing the game of Root Politics , or if
there finally has become better understanding in the user
community of how high reliability of the UNIX systems is
best reached. here is some of the various means that I
and other sysadmins have dealt with this dragon:
In small companies which do not have a large staff of
system administrators, it is impossible to have a
system administrator on duty all the time. On the
other hand, it is highly undesirable to hand out the
root password to everybody who might work
late at one time or another. A good compromise can be
reached by giving the security guard a sealed
envelope with the current root password.
If a user comes into a situation where they think
they need the root password, they can obtain
it at any time from the security guard, but will be
required to sign for the envelope, and will also be
required to file a report of the incident within 24
hours. The sheer intimidation of having to go though
signing and filling out a report will limit this to
very few cases which often are justified.
If the company is too small for a guard, a similar
systems can be established where the envelope is kept
in Mr. BigBoss's office (don't use your own office,
it won't work nearly as well).
- As an alternative, require everybody who
has the root password to carry a beeper and be
available for on-call duty on weekends and evenings.
The effect of the sysadmin answering any user request
for the root password with ``as soon as I can
get you a beeper'' has the interesting effect of most
people fast determinating that ``oh, by the way I don't
need it any longer''.
- On the more humorous side, one system administrator
I met at a conference some years ago claimed that he
had solved this particular problem by renaming the
root account to clerk . While many
people would like to be alble to become superuser or
root , the desire of being clerk were
To Know or Not to Know, That is the Question...
One of the biggest problems I have encountered so many
times that I have lost count, is the new UNIX system
administrator, who has to cope with maintaining the
systems while not having the necessary knowledge.
In many cases, these people have been given
responsibilities way beyond they capacity, mostly out of
desperation of the organizations management and user
community, as the explosive growth in UNIX system and
Internet connectivity has made it hard to find qualified
system administrators. For UNIX system administrators in
the early eighties, there where no books available on any
topic; anything you did not know you learned from the
UNIX manual or from the system itself. Today there is,
besides conferences and support from the Internet, there
have been published a large number of good books (and an
even larger number of mediocre ones) on various UNIX
system administration topics. In one way it makes it a
lot easier to learn to administrate the systems well, but
in other ways it makes me sometime think that it must be
almost like drinking from a fire hose. In many cases the
new people are very anxious to learn what is necessary,
and are often dealing with the steep learning curve
remarkedly well. It helps that most old time system
administrators still remember how it was to be new to
UNIX, and not having anybody to ask questions, They are
therefore more than willing to give advise and share
information. However, there still ample room for
One common mistake I see newer system administrators
do, is not learning to install the systems from scratch.
Even worse, they may not even have a clue to where the
necessary system software is stored, or even know if the
installation media is present anywhere locally. In such
cases, even if they are able to call in a consultant who
is capable of installing the system, it might not be
possible to complete the installation because of lack of
the necessary software.
One situation which specially comes to mind where at a
financial institution with offices in San Francisco and
Los Angeles. These two location where connected by a
lease line, but nobody onsite had any clue about hardware
or software which were used to make the connection work.
While this was bad enough by itself, there were no
documentation on either hardware or software to be found
anywhere. When I stressed the need to get the
information, and get the hardware on a service contract,
the answer was ``Why should we, it never breaks''. When
it did break, it took almost three weeks before that
network connection was back in operation.
I have also seen that the lack of knowledge and the
lack of documentation can lead to almost overpowering
procrastination. At one client site, which where using
some old Sun 4 file servers with equally old disk drives
where using some special software to concatenate two of
the drives to get one large file system. I had seen
messages in the system log which indicated that one of
the drives where failing. As those specific drives had a
notorious low Mean Time Between Failures, it where to be
expected that the drive would probably fail completely
with a few months at the most. However, because nobody in
the local technical staff understood the concatenation
software or how the drives where configured, management
decided that it where better not to replace the
malfunctioning drive, out of fear that it might not come
back online again. When the drive did fail a few weeks
later, t happened at the worst possible time during the
end-of-month processing, a big all weekend event. When I
got the go-ahead to replace the failed drive, it only
took a few hours to get the system back fully operational
while it took much longer redo the end-of-month process.
If the replacement had been done as scheduled
maintenance, it would have been less visible in high
places, and much less painful to everybody involved.
It has been my experience, that upper management is
almost always without technical knowledge, and will most
often not want to know anything about the technical
issues. This can make life difficult for a system
administrator. However some of the experience I had as a
consultant with management of various organizations, has
sometimes even been almost ridiculous. I rememeber more
than one time I have been in the situation where
management insisted that somebody should follow me
around, and keep and eye on what I did. While this may be
reasonable if the person accompanying with me has some
UNIX background, in several cases, that person did not
have a clue about UNIX. That I had the root
password to all the servers, and therefore could do
anything I wanted, should I have wanted to do something
bad, never seemed to occur to anybody.
What do I care
While lack of skilled people can be bad, being in a
situation where nobody cares is even worse.
I remember at one client site where the newly hired
sole system administrator told me that she ``did not do
networks''. Part of her responsibility where to maintain
the organizations Internet connection, so I had to tell
her "well, you do now". When their e-mail connectivity
broke a month later (their primary name server where out
of operation for some time) she got the point that it
might be a good idea to learn a little something
networking. However at that time upper management
objected to having to pay for her education. In many
cases a bad situation can be remedied and improved, but
there is also situations where getting uninvolved as fast
as possible is the only feasible alternative. This was
certainly one of these.
Over the years, I have talked with many system
administrators who have been in a bad situation, because
of upper management or their boss makes their life
miserable. It is very difficult to fight your boss and
win; if the work situation is really unsustainable, get
Operators can also be a source of problems. They do
usually not have much knowledge about the system they are
running, and are often easily intimidated by an outside
consultant. This can make them defensive and difficult to
work with, at least initially. However, if they get
inventive, specially on their own, it can be much worse.
One system administrator I talked with at a conference
told me about how one of their operators had decided that
it where much easier if he just labeled the backup tapes
and put them directly into the storage vault; actually
doing the backup took a long time and was a lot of work.
Nobody found out until one day when it was necessary to
make a restore. Then they discovered that the needed
backup tape were blank, and in fact all the backup tapes
were blank. It did apprently not occur to anybody that
the real problem where that nobody had ever taken the
time to explain to to the operator the purpose and
function of the backup. If he had understood the
consequences of his little shortcut it would proably
never have occured.
Backups are important! While doing backups is a boring
routine job, it is also the only thing standing between a
fast, painless recovery and massive loss of data when a
mistake is made. As long as we have good backups
available, we will be able to recover from almost any
mistake, that being a human, software, electrical or
mechanical failure. Also, as a more practical
consideration, people can loose their jobs because of bad
backups. Besides the two cases I have heard about from
other administrators, I have personal been witness to
cases. In the first case, the operation manager at a
large hardware company lost his job, when an disk drive
failed, and it was discovered that it was never backed
up. As the drive was the home of the release sources
tree, it had major repercussion everywhere. The root of
the problem was the engineering department had purchased
and installed that drive, but never notified the
operation manager about the new addition to the machine.
There is two lessons to be learned here:
- Don't let users do any maintenance on the machines
you have responsibility for. If your management forces
that decision on you, make clear that you cannot be
hold responsible for users action.
- And make sure you know your machines, how they are
configured, and what software they are running. This is
important for maintenance, for reliability and for
The second case where a system administrator got
fired, he was much more actively involved in the
failure. In my experience, whenever something goes
wrong, it require at least three independent
failures. In this case, the system administrator had
written his own backup script, without testing it
properly, he used the old AT&T cpio
archive program in stead of the much more reliable
dump program, he did the backup over NFS,
and he did the backup from cron redirecting
the diagnostic output to /dev/null , the
UNIX equivalent of a black hole. In other words, a
count no less than four mistakes which in combination
was a disaster waiting to happen..
As more disks where added to the systems, the
total amount of data started to exceed the capacity
of the tape, but all messages from the system where
redirected to /dev/null so nobody where any
wiser. The problem where first discovered when the
CEO accidently removed an important file, and could
not get it back, because it where backed up somewhere
past the end of the tape.
So if for no other reason, simple job security
makes is worth to spend some extra time to make sure
that a good control system is in place. Being
careless, or leaving backup to unsupervised operators
or junior system administrators can lead to a rude
and sudden awakening.
I usually recommend to my clients that they establish
a system which is centered around a check list. Some
of the actions must then include:
- Finally, it is a good idea to restore a complete
partition to a spare disk drive once a month. This help
to ensure that the people who are responsible for the
restore actually knows how to do it, and that they can
do it in minimal time.
- Finally, if incremental backup is used, it is a
good idea to use a simple scheme to keep down the
number of tapes used. E.g. if dump is used, doing daily
dumps at level 9, weekly dumps at level 5 and monthly
dumps at level 0 will be sufficiently simple that you
will most likely not goof up when doing a full restore
at 3 AM. Using the Tower of Hanoi algorithm
suggested in the dump man page will make it difficult
and time consuming to do the restore correctly at a
time when you are not at your brightest.
The New Guru's
One thing which is always scary, is when people had
learned just a little bit about a topic, and then
consider themselves fully qualified on all aspects of
that item. I still remember one place where I was helping
an organization to bridge the gap after their sysadmins
had left for another job, and was also spending some time
educating the new system administrator, who had been
promoted from within, but had little prior experience
with maintaining UNIX system. I still vividly remember
his first comment to me, when he returned from a one week
course introducing him to UNIX system administration.
Literally, he was greeting me with the words ``Where is
the kernel software, I want to reconfigure our kernel!''
when he walked in the door.
It has sometimes been difficult to make new people
understand that before you start change a system, you
better understand the one which already is in place. How
can someone think that they can improve a system they do
not yet understand is simply beyond my comprehension.
One area where I see a lot if ``New Guru's'' is in
connection with firewall technology. I have seen people
who have never worked with firewalls before starting to
make authoritative statements about what can and cannot
be done, without understanding the implications of their
statements. It seems to be another human trend, that when
a new, hot topic appears, everybody wants to be able to
speak about it with authority, even if they do not fully
understanding all the issues.
Another area where a lot of people speak loudly
without understanding their topic, is Policies and
Procedures. More and more people are referring to this in
a way where it almost becomes slogans, unfortunately many
UNIX people seems to think that implementing a policy
only consists of finding a policy somebody else wrote,
and maybe make a few changes here and there. There are
even large FTP archives on the Internet, with big
collections of policies, where you can go and pick one to
your liking. In the real world it does not work like
this. There is no computer installation anywhere, which
are without policies and procedures! They may not be
written down anywhere, the people may not be in agreement
what they are, and they may not be very helpful in
getting real work done, but nevert the less, they are
there. Implementing new explicitly documented policeis
which does not work well within the implicit policies and
the corporate culture will not work! If you want policies
which actually work, you either need to determine the
existing set of policies and then set out to document and
change whatever is there so it becomes something useful.
If you have sufficiently level of power (the CEO level)
you can just put down the law, if you are willing to
enforce them completely (do this or you are fired).
However, it might not be too good for the working morale.
Implementing polices is a slow and painstakingly process,
but if it is done right, it will pay off in the end.
Firewalls is just one small aspect of security, however
interesting it may be. There are many other aspects to
security, and many ways to do it well. However, it is
important to monitor the security of systems and
networks. One of the jokes in the firewall community, is
that the reason many sites think they have never been
broken into, is because they have no means in place to
detect when it takes place. Unfortunately, this does not
only apply to firewalls, it often apply to almost any
computer security isses at most of the sites I have been.
It does not help either, that security is often the
inverse of convenience. When a security feature is
installed, it always means some level of inconvenience to
the user. This will be the case when a user is required
to use non reuseable password when logging in over the
network; when the backup is stored in a vault for fire
and theft protection; and when implicit trust between
machines has been removed. This is why many people still
is getting caught using old style passwords and their
accounts compromised by crackers.
This is why the World Trade Center building bombing
left some companies not only without working computers,
but also without any recent backups to make new machines
take over. And this is why Tsutomu Shimomura where
enabling trust between his internal machines, and thereby
enabling Kevin Mitnick to compromise them.
And in the End ...
Let me finish with one of the funniest system
administration tales I ever remember hearing. This one
comes from Steve Simmons, another old time UNIX system
administrator turned consultant. One of his clients had
problems with their DEC line printer. DEC hardware
maintenance had been called time and again, in order to
fix the printer, but each time it broke down again
shortly after the maintenance people had left. Finally,
the lead sysadmin got so fed up with the problem, that he
decided to do something to really get the attention of
the maintenance people.
He removed the front cover of the printer, took it
down to his car and drove to the local shooting range,
where he put several bullets through the printer cover.
Afterwards he drove back, mounted the cover back on the
printer, and then called DEC hardware maintenance,
telling them that the printer where down again.