Automating frequent tasks

Number of processes generated

When you execute large numbers of short commands, the actual execution time of the commands might be dominated by the overhead of creating processes. The procedures that incur significant amounts of such overhead are those that perform much looping, and those that generate command sequences to be interpreted by another shell.

If you are worried about efficiency, it is important to know which commands are currently built into the shell, and which are not. Here is an alphabetical list of those that are built in to the Korn shell and Bourne shell (select is Korn shell only):

break case cd continue echo

eval exec exit export for

if read readonly return select

set shift test times trap

umask until wait while .

: {}

Note that echo and test also exist as external programs. Some other external commands have been added to the shells, but they are nonstandard and their use will impact the performance of shell scripts on other systems.

Parentheses, (), are built into the shell, but commands enclosed within them are executed as a child process; that is, the shell does a fork, but no exec. Any command not in the above list requires both fork and exec. The disadvantage of this is that when another process is execed it is necessary to perform a disk I/O request to load the new program. Even if the program is already in the buffer cache (an area of memory used by the system to store frequently accessed parts of the filesystem for rapid retrieval) this will increase the overhead of the shell script.

You should always have at least a vague idea of the number of processes generated by a shell procedure. In the bulk of observed procedures, the number of processes created (not necessarily simultaneously) can be described by the following:

processes = (k*n) + c

where k and c are constants for any given script, and n can be the number of procedure arguments, the number of lines in some input file, the number of entries in some directory, or some other obvious quantity. Efficiency improvements are most commonly gained by reducing the value of k, sometimes to zero. Any procedure whose complexity measure includes n squared terms or higher powers of n is likely to be intolerably expensive.

As an example, here is an analysis of a procedure named file2lower, whose text is as follows:

   #!/bin/ksh
   #
   # file2lower --	renames files in parameter list to
   #		all-lowercase names if appropriate
   #
   PATH=/bin:/usr/bin
   for oldname in "$@"
   do
   	newname=`echo $oldname | tr "[A-Z]" "[a-z]"`
   	if [ $newname != $oldname ]
   	then
   	{
   		if [ ! -d "$oldname ]
   		then
   		{
   			mv "$oldname" "$newname"
   			print "Renamed $oldname to $newname"
   		}
   		else
   			print "Error: $oldname is a directory" >&2
   		fi
   	}
   	fi
   done

This shell script checks all the names in its parameter list; if a file of that name exists, is writable, and contains uppercase letters in its name, it is renamed to a lowercase equivalent. This is useful when copying files from a DOS filesystem, because files imported from DOS have all uppercase names.

For each iteration of the main do loop, there is at least one if statement. In the worst case, there are two ifs, an mv and a print. However, only mv is not built into the shell. If n is the number of files named by the parameter list, the number of processes tends towards (4n)+0. (The c term of the equation given above is applicable to commands executed once before and after the loop.)

Some types of procedures should not be written using the shell. For example, if one or more processes are generated for each character in some file, it is a good indication that the procedure should be rewritten in C or awk. Shell procedures should not be used to scan or build files a character at a time.

break	case	cd	continue	echo
eval	exec	exit	export	for
if	read	readonly	return	select
set	shift	test	times	trap
umask	until	wait	while	.
:	{}