Using awk

Spanning multiple lines

Multiline records are a suitable solution for handling data that is regular in form, but sometimes it is necessary to handle records that have no fixed length. Under these circumstances, when the input data is irregular, setting RS and FS is not going to be very helpful.

An illustrative example of this kind of problem occurs in the example ``Writing a readability analysis program: an example''; one of the tasks is to count the number of sentences in a text file. Although the file consists of lines of text terminated by a carriage return, there is no guarantee that a sentence occupies a single line. Sentences are started by a word with an initial capital letter and are terminated by a period: they may occupy one or more lines.

This kind of problem can be handled by using a variable as a buffer. A line is read in and appended to the buffer; the entire buffer is then searched to see if it contains a sentence. If no sentence is contained, another line is read, and so on. If a sentence is matched, it is counted and all the data in the buffer up to the end of the sentence is deleted.

For example:

#!/usr/bin/awk -f
BEGIN {
 init="(^[A-Za-z1-90][.])|([[:space:]]|[.])(([A-Za-z0-9]|[A-Za-z0-9][a-z0-9])[.])"
 sent="([A-Za-z1-90]+([[:space:]])*)+[.]"
 sentences = 0
 target = ""
 marker="+X+"
 }
 {    initials = gsub(init, "", $0)
 target = target " " $0
 hit = gsub(sent, marker, target)
 sentences += hit
 if (hit != 0) {
     for (i=0; i< hit; i++) {
         found = index(target, marker)
         target = substr(target, (found+3))
     } # end for
 } # end if
 hit = 0
 }
END {    print sentences " sentences counted"
  }

The BEGIN section is used to define a sentence (in variable sent). For the purpose of this program, a sentence is a regular expression consisting of a sequence of words terminated by a period. (A word is one or more letters followed by an optional space. Because awk matches the left-most longest pattern, in practice we can expect awk to choose the longest series of letters it can find that is terminated by a space.)

Sentences are not the only entities terminated by a period; initials and elipses contain periods, and must therefore be removed before the input is tested to determine if it is a sentence. The BEGIN section defines the variable init as a regular expression that matches a set of initials.

Initials consist of a letter or digit followed by a period, or a more complex format (one or two letters or digits followed by a period, as in Ph.D.). This expression is not infallible, but traps most initials.

Each line in the standard input ($0) is read in and scanned for initials. These are replaced by null characters (""). The line is then appended to the variable target (line 10 of the program).

On line 11, target is scanned and every occurrence of a sentence (as defined by sent) is replaced by a marker (defined by marker). The total sentence count is incremented by the number of sentences found in target. Then, for each hit, the target variable is truncated; the text prior to the last marker is discarded. (That is, the matched sentences are removed from target, by the substr command on line 16.) The program then reads the next line. At the end of the input, the script displays a total sentence count.

It is worth comparing this method of crossing line boundaries with the method using sed in ``Hold and get functions''.

Syllables are handled in a similar manner, but it is not necessary to handle multiple lines. Instead, a regular expression that matches a generic syllable is defined, and a simple loop globally replaces all syllables with a marker character while incrementing a counter:

#!/bin/sh
#
# First, define syllabic consonants
#
CONS="[bcdfghjklmnpqrstvwxyz]|ll|ght|qu|([wstgpc]h)|sch"
#
# Next, define syllabic vowels
#
VOWL="[aeiou]+|ly"
#
# The definition of a syllable (after Webster's Collegiate Dictionary):
# a syllable is one or more consonants or vowels, optionally preceded by
# and optionally followed by a consonant.
#
SYL="(${CONS})*\
((${CONS})|((${VOWL})+))\
(${CONS})*"
#
sylcount=`awk -e ´ BEGIN { sylcount = 0 }
                   	 { target = $0
			   incr = gsub(syllable, "*", target)
			   sylcount += incr
			 }
		   END   { print sylcount }
´ syllable="$SYL" < $1`
echo "There were $sylcount syllables in $1"

Note that for the purposes of matching a syllable, we need to use syllabic consonants and syllabic vowels. These correspond to the written representations of parts of speech, rather than to the letters of the alphabet. Therefore, the syllable definition given above is so complex that it is better to build it up from component expressions stored in environment variables (as above) than to try to write it out at length.

After processing the specified input file ($1), the script displays a count of all the syllables located.