Linda Bissum


Quick Index

  This paper was originally published in the Proceedings for The Third World Conference on System Administration, Networking and Security, Washington DC, 1994.

The Good, The Bad and The Ugly.
Anecdotes from the System Administration Trenches

Linda Bissum

ABSTRACT

Making mistakes are never difficult. Making mistakes when adminstrating many UNIX systema is certainly easy. In spite of this. most papers tend to focus only on what works. This paper is descriping ways I have seen and experenced, which does not work.

Introduction

I have been working with UNIX system administration and security for over a decade, and like this environment system very much. In some way, the UNIX operating system is like a human being. It can made to do almost any task adequately, however, it will do almost nothing perfectly. UNIX also allows most tasks to be performed in a large variety of ways, unfortunately, this also leave plenty of room for mistakes.

According to an old saying, the way to avoid making mistakes is to have experience. And how does one get experience? By making mistakes! Learning from other peoples mistakes sounds good in theory, but unfortunately does not work too well in praxis. In spite of this, I hope some people may gain some usefull insights from reading this paper.

This paper is a tour down memory lane, revisiting some of my mistakes and experiences. It is mostly about mistakes done by system administrators, users or management. However, it is also about how such mistakes can be avoided, and how, in a general sense, how the work in the trenches can be made more bearable.

Cleaning up the Hard Disk

I don't believe that there is any real UNIX System Administrator anywhere who has not, at one time or another, executed the famous rm -rf * in the wrong place, and deleted a large number of files which should never have been removed. In fact, when I meet system administrators who claim that they have never ever done anything like this, I make a mental note to make sure that they do never get access to one of my systems; having an unexperienced system administrator can be bad, but having one who is afraid of admitting mistakes can be down right lethal.

The all time worst I have seen, where at one computer installation, where management had, in their infinitive wisdom, decided that all key engineers should have the root password for all the key servers. This specific event happened a Friday late afternoon, where an engineer had discovered that a server where almost running out of space in the root directory. Being a genially nice guy, he decided that he did not need to disturb me, and that he could fix this himself. He searched for some large file in the root directory which could be removed. The two largest files he could find where /boot and /vmunix , and as they had not been modified for a long time, they would surely be on the backup tape, and could therefore be removed safely. After the files had been removed, just to be sure that the disk space was recovered he decided to reboot the system. Much to his surprise, the system would not boot, and I got to spend too much of my weekend fixing the problem (it where back in the ``good'' old days, when you had to boot the system from tape).

This experience was the drop for me which made the cup flow over. Ever since, I have been a firm believer in users not having the root password for any mission critical machine. Of cause the politics of root password and superuser access has always been at the forefront of the life in the system administration trenches.

The Root Of All Evil

It seems that almost everybody who uses a UNIX system wants to be able to have root access. In some way this is very understandable for it is part of the human nature to strive to become better in at whatever we do. For a UNIX user, it seems almost natural that the way to growth is to go from being a normal user to become a superuser. What most people forget, is that with the increased power that a superuser enjoys, being able to play God in the Unixverse, it also comes with a big responsibility to balance the need of all users of the system.

I don't know if I am just become better over the years at playing the game of Root Politics , or if there finally has become better understanding in the user community of how high reliability of the UNIX systems is best reached. here is some of the various means that I and other sysadmins have dealt with this dragon:

  • In small companies which do not have a large staff of system administrators, it is impossible to have a system administrator on duty all the time. On the other hand, it is highly undesirable to hand out the root password to everybody who might work late at one time or another. A good compromise can be reached by giving the security guard a sealed envelope with the current root password.

    If a user comes into a situation where they think they need the root password, they can obtain it at any time from the security guard, but will be required to sign for the envelope, and will also be required to file a report of the incident within 24 hours. The sheer intimidation of having to go though signing and filling out a report will limit this to very few cases which often are justified.

    If the company is too small for a guard, a similar systems can be established where the envelope is kept in Mr. BigBoss's office (don't use your own office, it won't work nearly as well).

  • As an alternative, require everybody who has the root password to carry a beeper and be available for on-call duty on weekends and evenings. The effect of the sysadmin answering any user request for the root password with ``as soon as I can get you a beeper'' has the interesting effect of most people fast determinating that ``oh, by the way I don't need it any longer''.
  • On the more humorous side, one system administrator I met at a conference some years ago claimed that he had solved this particular problem by renaming the root account to clerk . While many people would like to be alble to become superuser or root , the desire of being clerk were much smaller.

To Know or Not to Know, That is the Question...

One of the biggest problems I have encountered so many times that I have lost count, is the new UNIX system administrator, who has to cope with maintaining the systems while not having the necessary knowledge.

In many cases, these people have been given responsibilities way beyond they capacity, mostly out of desperation of the organizations management and user community, as the explosive growth in UNIX system and Internet connectivity has made it hard to find qualified system administrators. For UNIX system administrators in the early eighties, there where no books available on any topic; anything you did not know you learned from the UNIX manual or from the system itself. Today there is, besides conferences and support from the Internet, there have been published a large number of good books (and an even larger number of mediocre ones) on various UNIX system administration topics. In one way it makes it a lot easier to learn to administrate the systems well, but in other ways it makes me sometime think that it must be almost like drinking from a fire hose. In many cases the new people are very anxious to learn what is necessary, and are often dealing with the steep learning curve remarkedly well. It helps that most old time system administrators still remember how it was to be new to UNIX, and not having anybody to ask questions, They are therefore more than willing to give advise and share information. However, there still ample room for mistakes.

One common mistake I see newer system administrators do, is not learning to install the systems from scratch. Even worse, they may not even have a clue to where the necessary system software is stored, or even know if the installation media is present anywhere locally. In such cases, even if they are able to call in a consultant who is capable of installing the system, it might not be possible to complete the installation because of lack of the necessary software.

One situation which specially comes to mind where at a financial institution with offices in San Francisco and Los Angeles. These two location where connected by a lease line, but nobody onsite had any clue about hardware or software which were used to make the connection work. While this was bad enough by itself, there were no documentation on either hardware or software to be found anywhere. When I stressed the need to get the information, and get the hardware on a service contract, the answer was ``Why should we, it never breaks''. When it did break, it took almost three weeks before that network connection was back in operation.

I have also seen that the lack of knowledge and the lack of documentation can lead to almost overpowering procrastination. At one client site, which where using some old Sun 4 file servers with equally old disk drives where using some special software to concatenate two of the drives to get one large file system. I had seen messages in the system log which indicated that one of the drives where failing. As those specific drives had a notorious low Mean Time Between Failures, it where to be expected that the drive would probably fail completely with a few months at the most. However, because nobody in the local technical staff understood the concatenation software or how the drives where configured, management decided that it where better not to replace the malfunctioning drive, out of fear that it might not come back online again. When the drive did fail a few weeks later, t happened at the worst possible time during the end-of-month processing, a big all weekend event. When I got the go-ahead to replace the failed drive, it only took a few hours to get the system back fully operational while it took much longer redo the end-of-month process. If the replacement had been done as scheduled maintenance, it would have been less visible in high places, and much less painful to everybody involved.

It has been my experience, that upper management is almost always without technical knowledge, and will most often not want to know anything about the technical issues. This can make life difficult for a system administrator. However some of the experience I had as a consultant with management of various organizations, has sometimes even been almost ridiculous. I rememeber more than one time I have been in the situation where management insisted that somebody should follow me around, and keep and eye on what I did. While this may be reasonable if the person accompanying with me has some UNIX background, in several cases, that person did not have a clue about UNIX. That I had the root password to all the servers, and therefore could do anything I wanted, should I have wanted to do something bad, never seemed to occur to anybody.

What do I care

While lack of skilled people can be bad, being in a situation where nobody cares is even worse.

I remember at one client site where the newly hired sole system administrator told me that she ``did not do networks''. Part of her responsibility where to maintain the organizations Internet connection, so I had to tell her "well, you do now". When their e-mail connectivity broke a month later (their primary name server where out of operation for some time) she got the point that it might be a good idea to learn a little something networking. However at that time upper management objected to having to pay for her education. In many cases a bad situation can be remedied and improved, but there is also situations where getting uninvolved as fast as possible is the only feasible alternative. This was certainly one of these.

Over the years, I have talked with many system administrators who have been in a bad situation, because of upper management or their boss makes their life miserable. It is very difficult to fight your boss and win; if the work situation is really unsustainable, get another job.

Backup, Backup

Operators can also be a source of problems. They do usually not have much knowledge about the system they are running, and are often easily intimidated by an outside consultant. This can make them defensive and difficult to work with, at least initially. However, if they get inventive, specially on their own, it can be much worse. One system administrator I talked with at a conference told me about how one of their operators had decided that it where much easier if he just labeled the backup tapes and put them directly into the storage vault; actually doing the backup took a long time and was a lot of work. Nobody found out until one day when it was necessary to make a restore. Then they discovered that the needed backup tape were blank, and in fact all the backup tapes were blank. It did apprently not occur to anybody that the real problem where that nobody had ever taken the time to explain to to the operator the purpose and function of the backup. If he had understood the consequences of his little shortcut it would proably never have occured.

Backups are important! While doing backups is a boring routine job, it is also the only thing standing between a fast, painless recovery and massive loss of data when a mistake is made. As long as we have good backups available, we will be able to recover from almost any mistake, that being a human, software, electrical or mechanical failure. Also, as a more practical consideration, people can loose their jobs because of bad backups. Besides the two cases I have heard about from other administrators, I have personal been witness to cases. In the first case, the operation manager at a large hardware company lost his job, when an disk drive failed, and it was discovered that it was never backed up. As the drive was the home of the release sources tree, it had major repercussion everywhere. The root of the problem was the engineering department had purchased and installed that drive, but never notified the operation manager about the new addition to the machine. There is two lessons to be learned here:

  • Don't let users do any maintenance on the machines you have responsibility for. If your management forces that decision on you, make clear that you cannot be hold responsible for users action.
  • And make sure you know your machines, how they are configured, and what software they are running. This is important for maintenance, for reliability and for security.
  • The second case where a system administrator got fired, he was much more actively involved in the failure. In my experience, whenever something goes wrong, it require at least three independent failures. In this case, the system administrator had written his own backup script, without testing it properly, he used the old AT&T cpio archive program in stead of the much more reliable dump program, he did the backup over NFS, and he did the backup from cron redirecting the diagnostic output to /dev/null , the UNIX equivalent of a black hole. In other words, a count no less than four mistakes which in combination was a disaster waiting to happen..

    As more disks where added to the systems, the total amount of data started to exceed the capacity of the tape, but all messages from the system where redirected to /dev/null so nobody where any wiser. The problem where first discovered when the CEO accidently removed an important file, and could not get it back, because it where backed up somewhere past the end of the tape.

    So if for no other reason, simple job security makes is worth to spend some extra time to make sure that a good control system is in place. Being careless, or leaving backup to unsupervised operators or junior system administrators can lead to a rude and sudden awakening.

  • I usually recommend to my clients that they establish a system which is centered around a check list. Some of the actions must then include:
    • The operator checks off all backup on the checklist as they are performed.
    • All status messages from the backup system is mailed to another system administrator, who checks them for any problems. He then signs the checklist, to indicate that all backups where completed correctly.
    • In addition, every day a single file, chosen at random, is restored from lasts nights backup. The restore should be done on a random tape drive, but not the one used to write the tape.

      This procedure helps to both ensure that the tapes has valid data on them, and will also ensure that the tapes can be read on other drives than the one they where created on.

  • Finally, it is a good idea to restore a complete partition to a spare disk drive once a month. This help to ensure that the people who are responsible for the restore actually knows how to do it, and that they can do it in minimal time.
  • Finally, if incremental backup is used, it is a good idea to use a simple scheme to keep down the number of tapes used. E.g. if dump is used, doing daily dumps at level 9, weekly dumps at level 5 and monthly dumps at level 0 will be sufficiently simple that you will most likely not goof up when doing a full restore at 3 AM. Using the Tower of Hanoi algorithm suggested in the dump man page will make it difficult and time consuming to do the restore correctly at a time when you are not at your brightest.

The New Guru's

One thing which is always scary, is when people had learned just a little bit about a topic, and then consider themselves fully qualified on all aspects of that item. I still remember one place where I was helping an organization to bridge the gap after their sysadmins had left for another job, and was also spending some time educating the new system administrator, who had been promoted from within, but had little prior experience with maintaining UNIX system. I still vividly remember his first comment to me, when he returned from a one week course introducing him to UNIX system administration. Literally, he was greeting me with the words ``Where is the kernel software, I want to reconfigure our kernel!'' when he walked in the door.

It has sometimes been difficult to make new people understand that before you start change a system, you better understand the one which already is in place. How can someone think that they can improve a system they do not yet understand is simply beyond my comprehension.

One area where I see a lot if ``New Guru's'' is in connection with firewall technology. I have seen people who have never worked with firewalls before starting to make authoritative statements about what can and cannot be done, without understanding the implications of their statements. It seems to be another human trend, that when a new, hot topic appears, everybody wants to be able to speak about it with authority, even if they do not fully understanding all the issues.

Another area where a lot of people speak loudly without understanding their topic, is Policies and Procedures. More and more people are referring to this in a way where it almost becomes slogans, unfortunately many UNIX people seems to think that implementing a policy only consists of finding a policy somebody else wrote, and maybe make a few changes here and there. There are even large FTP archives on the Internet, with big collections of policies, where you can go and pick one to your liking. In the real world it does not work like this. There is no computer installation anywhere, which are without policies and procedures! They may not be written down anywhere, the people may not be in agreement what they are, and they may not be very helpful in getting real work done, but nevert the less, they are there. Implementing new explicitly documented policeis which does not work well within the implicit policies and the corporate culture will not work! If you want policies which actually work, you either need to determine the existing set of policies and then set out to document and change whatever is there so it becomes something useful. If you have sufficiently level of power (the CEO level) you can just put down the law, if you are willing to enforce them completely (do this or you are fired). However, it might not be too good for the working morale. Implementing polices is a slow and painstakingly process, but if it is done right, it will pay off in the end.

Security

Firewalls is just one small aspect of security, however interesting it may be. There are many other aspects to security, and many ways to do it well. However, it is important to monitor the security of systems and networks. One of the jokes in the firewall community, is that the reason many sites think they have never been broken into, is because they have no means in place to detect when it takes place. Unfortunately, this does not only apply to firewalls, it often apply to almost any computer security isses at most of the sites I have been.

It does not help either, that security is often the inverse of convenience. When a security feature is installed, it always means some level of inconvenience to the user. This will be the case when a user is required to use non reuseable password when logging in over the network; when the backup is stored in a vault for fire and theft protection; and when implicit trust between machines has been removed. This is why many people still is getting caught using old style passwords and their accounts compromised by crackers.

This is why the World Trade Center building bombing left some companies not only without working computers, but also without any recent backups to make new machines take over. And this is why Tsutomu Shimomura where enabling trust between his internal machines, and thereby enabling Kevin Mitnick to compromise them.

And in the End ...

Let me finish with one of the funniest system administration tales I ever remember hearing. This one comes from Steve Simmons, another old time UNIX system administrator turned consultant. One of his clients had problems with their DEC line printer. DEC hardware maintenance had been called time and again, in order to fix the printer, but each time it broke down again shortly after the maintenance people had left. Finally, the lead sysadmin got so fed up with the problem, that he decided to do something to really get the attention of the maintenance people.

He removed the front cover of the printer, took it down to his car and drove to the local shooting range, where he put several bullets through the printer cover. Afterwards he drove back, mounted the cover back on the printer, and then called DEC hardware maintenance, telling them that the printer where down again.