Linda Bissum


Quick Index

  This paper was originally published in the Proceedings for First World Conference on System Administration, Networking and Security in Washington DC, 1992.

Policies and Procedures.
The Human Aspect of UNIX System Administration

ABSTRACT

When asked, most people would describe the task of UNIX system administration as being purely technical in nature. While there is definitively a need for good technical skills and a good, broad understanding of how the systems are working and interoperating, this is only part of the required skills. In addition to that, a UNIX system administrator must, on an everyday basis, interact with people who are representing both management and users. Therefore, good people skills are necessary as well. Any system administrator who attempts to provide a technical solution to a problem, which is of people nature, will fail miserably. This paper is describing how to create realistic policies and procedures, and how they can be used to assist the system administrator in providing people oriented solutions.

Introduction

Too many of todays UNIX system administrators are very overworked, spending a major part of their time fighting fires, leaving too little time to work on the solutions which could help to reduce the number of fires to fight. Strategic, long-range planning is the only way, I know of, to reduce the number of emergencies, which require a here and now solution.

In addition, most UNIX system administrators see themselves basically as being a provider of technical solutions. However, all such solutions are ultimately applied to support the users, who are people. Therefore, many problems a system administrator will encounter are people-type problems. A technical solution provided to such a problem will no more work, than a non-technical solution will work for a purely technical problem (of cause, very few problems encountered by a system administrator will ever be purely one or the other).

Therefore many UNIX installations are missing a framework required to make smooth decisions to the people problems. This paper is describing such a framework, based in explicit written and published policies and procedures.

Why Policies and Procedures

Policies and procedures are providing a framework to make decisions. As there is some similarity between the two, it is important to understand the differences.

Policies are the rules laid down by upper management. They delegate authority and outline requirements, enabling the system administrator to make decisions which in line with the overall purpose of the organization.

Procedures are the detailed documentation which describes how to do specific tasks, like backup or adding users. >/P>

Because the upper management are the people running the company, setting goals and strategies, it is important that all polices are signed off by them. This also ensures such polices cannot easily be overturned, in the way they can, if decided on a too low a level in the organizational hierarchy. However, upper management will most likely not have the necessary knowledge and understanding to provide such policies without assistance. It will therefore be the important task of the system administrator to provide draft polices, and to be available to teach and guide.

In addition to this, it is also very useful to get a broad support from the users for any policies which are being written. Not only will it be a lot easier to enforce such policies, it will significantly easier to get the attention and approval from upper management, if it can be shown a broad support from the user base for such steps.

We have to keep in mind, however, that users are viewing the computer systems in a much different manner than we do as system administrators or system programmers. Most users are not interested at all in anything which has to do with the operation system itself, beyond the minimal knowledge they need to have to do their daily job. Whenever any changes occurs at the site, unless the user is clearly directly benefiting from the change, expect a resistance to any change to occur. Again, expect to teach and guide, in order to show why certain steps must be taken. It will most likely help the situation to make it clear that any change made, which later prove to have undesired side effects is open to improvements at any time.

It is important that the system administrator takes the initiative to get the necessary policies implemented. Neither upper management nor users will have the understanding to take the action to start such a development. In fact, the examples I have seen of polices which has been dictated from ``above'' have almost all been either very difficult to enforce, or completely failed to solve the real problems!

One area of common failures, although not part of system administration, are the use of coding standards. I have only seen such an attempt to actually work once. In that case, the coding standard was developed by the programmers, and simple peer pressure was the main enforcement. This should represent an important lesson for any system administrator about to make changes, such as those discussed in this paper.

Analysis of Site Requirements

Anybody starting a project of implementing better policies and procedures in an environment where such items did not previously exist, has taken on a major project. However, even where no explicit written policies previous where available, a number of unwritten policies is bound to exist. The task of the site analysis is to investigate the day to day operations, and document in a reasonable manner the existing policies and procedures. This document can then be used as a base for discussing and implementing the new policies and procedures. One way to obtain such information is to interview upper and middle management, as well as the key users. The information to look for is various requirements and problems, from either users or management. When discussing problems, it is important to look for both real and perceived problems. This is important, because if a problem is imagined by a user or manager, it is still very real to that person, and will required to be addressed in one form or other.

Some Brief examples of Policies

In the following, a few examples of policies will be outlined. As it is not possible to give a complete coverage in a short paper as this, only the following policies will be discussed.

  • Site Policy
  • Security Policy
  • Disaster Policy

The Site Policy

The site policy is useful to state general terms for the site. It can almost bee seen as the "constitution" for the site. It states the overall policy in general terms. It must be non-technical in content, in order to be understood by everybody. For example, it will state the requirement of daily backup, but will not give any detailed specification. These can be found in the backup policy, and in the backup procedures. The site policy is also a good place to state requirements for additional policies. It will take a good deal of time to get all polices implemented, and the site policy is a good place to state the requirements to such additional work.

As mentioned above, it is extremely important to have the upper management to sign off on the policy. This will create a clear and visible statement of support from upper management. A system administrator who is trying to implement this kind of policy will find that it will only last until the first confrontation with a key user.

Security Policy

Having a good security policy is important these days. Every site should have one, whether or not it is connected to the Internet (The required security policy will be very different for a site connected to the Internet, compared to a UUCP only site. However, they both need one).

The need for a security policy and its purpose is often viewed very differently by a system administrator compared to upper management. The best possible way is to express the requirements in terms upper management can understand. Amazingly enough, a statement of the site being vulnerable to intruders will in many cases have very little reaction. The same information, expressed in form of a requirement to provide necessary protection of assets, and a discussion of data integrity will on the other hand almost always get attention.

Statistical Perspective of Security

It is necessary to keep in mind what is the purpose of a security policy. Good security is expensive, and can be viewed as an unnecessary overhead by some. It is therefore necessary focus on the purpose of the security. It appears that most security efforts are aimed against outside intrusion. While this is clearly important, it must also be kept in mind that most security incidents are unrelated to outside intruders (Datamation, 1990)

Types of Security Incidents
Insider treats 70% to 80%
Physical treats 15% to 30%
Outsider treats 1% to 5%

In addition, most inside incidents are mistakes:

Types of Inside Security Incidents
Human error, accidents 50% to 60%
Dishonest employees 10%
Disgruntled Employees 10%
All others 20% to 30%

Many companies have an open personal policy, and are almost viewing internal security as an evil thing. However, if reliable computing resources are necessary, internal security must be addressed. As the statistics shows, most incidents are stupid mistakes, often done by well meaning users without the necessary understanding to make good decision.

RFC 1244 gives some very good guidelines for implementing a security policy. The examples given here are therefore kept brief. Some examples of the content of a security policy would be:

  • Requirement of password on all accounts
  • Security guidelines for modems and Internet.
  • Who has the root password.

It always seems to be a controversial question who has the root privileges. It can at time be very difficult to convince manager and users alike, that it is to the best for the site, when as few people as possible know this password. Examples of a few inventive methods which can help to counteract some requests for the root password (Unfortunately I no longer remember who contributed these suggestions):

  • Rename user root to clerk - nobody wants to be clerk.
  • When somebody requests the root password, reply: "Yes, as soon as I can get you a beeper, for as you surely know, all people with the root password are on call".
  • Leave the root password in a sealed envelope with the security guard.

I have actually used the last suggestion successful at a large company which had security guards on duty 24 hours a day. The intimidation factor worked so well, that the envelope was very seldom used.

Disaster Recovery Plan

Many UNIX people assume a disaster recovery plan is something for very large installations. Such disaster recovery plans includes standby computer centers, or similar other very expensive solutions, just waiting for the disaster to hit. However, even very small organizations will do well in by having a disaster recovery plan. Also, a disaster is often associated with large events, like an earthquake or a fire destroying the plant. However, much smaller events can prove to be a disaster, if they are not planned for in advance. Consider for example:

  • Loosing the root disk on a main server.
  • Loosing the only Exabyte drive.
  • Insufficient backup or untested backup program

Good planning ahead of time can save much time, when the planned for event occurs, because the system administrator will know specifically what to do, and further the necessary resources will be available. For example, replacing a failed disk can be a very time consuming procedure. However, if a \fIformatted\fP spare drive is available on the site, the effected system will be able to get back online in a significantly shorter time. If the drive would first need to be formatted, this would add time to the replacement work. The tradeoff here is the investment in a spare drive, versus the time to recover after failure. The answers to such questions will be different for each organization, as the answer depends on available capital, cost of downtime, time pressure on projects, etc.

Procedures

Procedures are as important to get documented as the policies, although they serve a very different purpose. A procedure is giving a step by step instruction for how a specific task must be executed. Well documented procedures mean more work can be done by less skilled people. One show example of this is that in many installation, workstations are installed and configured by system administrators. However, with a good procedure, workstations can easily be installed by an operator. This would leave the system administrator free to do work more adequate to that persons skill level. Similar, a system administrator should only need to get involved with backup and restores in a planning capacity, or in case problems occur. The actual tack of performing the task should be done by an operator. Many other tasks are often done by overskilled personal, because the procedure instead of being documented, is almost treated as a black art.

In addition to the advantage of being able to utilize people better, and providing a better work satisfaction, a well documented procedure is lending itself well to automation. Many attempts I have seen to automate simple procedures, as e.g. adding and deleting accounts, has failed because the procedure was not well understood prior to the start of the automation effort.

It is also important to ensure that the management is educated about the purpose and expected return of these tasks. I know of at least one company, where good procedures where implemented, and followed by a decision from upper management, stating that the system administrator was no longer needed. Upper management assumption proved to be right for the first six month. At that time the reliability of the systems started to deteriorated very fast. The cost of repairing the damages created by this shortsighted decision was significantly bigger than the cost would have been of keeping the system administrator during those six months.

Communication

As it can be seen, the corner stone of the process of introducing good policies and procedures is communication. During the process, many requirements for technical changes are sure to arise. It is extremely important to get cooperation from users by letting them know ahead of time of any upcoming changes. A very common and very reasonable complaint I have heard from the user community is that changes takes place without any advanced note. A system administrator does not always have the option of give an advanced notice However, this should be possible in the vast majority of cases. Nothing can be more frustrating for a user, than coming in on the weekend to catch up on a late project, to find the main file server down for upgrades.

Such notification can be done though e-mail, a small newsletter, \fI/etc/motd\fP or messages boards outside the computer room(s). However, an increasing number of companies are starting to have a user committees, who are used as the main point of contact to the user community. Such a committee can be used to clear plans for major changes and to ensure timing of such changes will have the smallest possible impact on the users work schedules. A small newsletter is extremely useful to broadcast such decisions to the user community at large.

It is very important to establish ways for the users to communicate problems and wishes to the system administrator in an informal manner. While most requests can and should be handled through the official procedures, it has been my experience, that it is also important to have an unofficial way where the users can express themselves, directly to the system administrator. One example of this has been reported by Max Vasilatos when she was working on creating the computing center at OSF. While there was a large amount of work which had to be done to build this installation, she found that the users where generally more satisfied, if she spend a few hours every day, just working the hallways, being available for small-talk, compared to when that time was used to solve prevailing problems.

While good communication is important, it will not be possible to solve all problems. The reasons for this could be the resource asked for is unavailable, or maybe that specific user is just a pain in the neck, and cannot be accommodated. Whatever the reason is to reject a users request, two things can help: First explain to the user why the rejection (policy, budget, time restraints, etc) is necessary, and encourage the user to escalates the problem to his/hers manager. Second, immediately let your own manager know what had occurred, and your reasoning for the rejection. I have found, that when this is done every time, the manager will get increased trust in the expressed judgment, and will seldom overrule it without good reason.

The same procedure should be used, when confronted with a problem which is not within the scope of the system administrators responsibility. The key is to keep the manager informed of ongoing events. Managers hate to get surprises, specially bad ones.

Even with all these methods employed, it is not always possible to get to satisfactory results. In such cases, it can be necessary to use CYA methods to ensure not to become a scapegoat at a later time.

In such cases, memos on special problems can be helpful if done right (that is professional and informative). However one of the best methods in my experience is to use weekly status reports, stating progress and upcoming plans. If a problems occurs, where it is impossible to get a solution to, placing it on the bottom of the status report and leave it there on the report every week until resolved.

One example of where this strategy paid off, where in one small startup company, which had 25 XENIX machines, and only two 40 Mbyte cartridge tape drives. With this equipment, it was impossible to perform a reasonable backup, and I requested a 9 track tape drive (which was the high capacity/high speed equipment of the time). Upper management decided that the $10.000 necessary was too expensive, and that I would have to do with existing equipment. Knowing that the current backup scheme was extremely insufficient, I started to put on my status report every week that the backup scheme was insufficient, and needed resolution. This continued for almost four month, in spite of many remarks from my manager that he already knew about the backup, and I needed not tell him every week.

However, after four month, the inevitable happened. One of the machines had a disk crash, and Murphy's law went into full effect immediately. First, the disk contained the software for a major release, announced to take place three days later; second the machine was scheduled for backup that night, making the last backup two weeks old; and on top, the last backup was impossible restore, requiring the previous backup be restored instead (this one was four weeks old). After the dust had settled, I could prove that I had warned against such a event on my status report for the past four month. If this had not been the case, I would properly have lost my job. Instead, because of the CYA effect, I got $10.000 to purchase a new tape drive.