Automating frequent tasks

Readability analysis

Four different readability statistics are calculated within analyze. Readability statistics assess variables including the average number of words per sentence, average length of sentences, number of syllables per word, and so on, to derive a formulaic estimate of the ``readability'' of the text. They do not take into account less quantifiable elements such as semantic content, grammatical correctness, or meaning. Thus, there is no guarantee that a text that a readability test identifies as easy to understand actually is readable. However, in practice it has been found that real documents that the tests identify as ``easy to read'' are likely to be easier to comprehend at a structural level.

The four test formulae used in the analyze function are as follows:

Automated Readability Index: The Automated Readability Index (ARI) is based on text from grades 0 to 7, and intended for easy automation. ARI tends to produce scores that are higher than Kincaid and Coleman-Liau, but are lower than Flesch.
Kincaid formula: The Kincaid formula is based on navy training manuals ranging from 5.5 to 16.3 in grade level. The score reported by the formula tends to be in the mid-range of the four formulae. Because it is based on adult training manuals rather than schoolbook text, this formula is most applicable to technical documents.
Coleman-Liau Formula: The Coleman-Liau formula is based on text ranging from .4 to 16.3. This formula usually yields the lowest grade when applied to technical documents.
Flesch Reading Ease Score: The Flesch formula is based on grade school text covering grades 3 to 12. The difficulty score is reported in the range 0 (very difficult) to 100 (very easy).

To calculate these metrics, analyze first counts the number of words, lines and sentences in the target file, generating output like the following:

   File rap-bat.wc contains:
           243             words
           95              lines
           1768            characters

Sentences are counted using a custom awk script, explained in ``Spanning multiple lines''. Then the number of letters is established (by subtracting the white space from the file and counting the number of characters), and the number of syllables is estimated using another awk script. Finally, these values are fed into four calculations that make use of bc, the SCO OpenServer binary calculator.

bc is a simple programming language for calculations; it recognizes a syntax similar to C or awk, and can use variables and functions. It is fully described in bc(C), and is used here because unlike the shell's eval command, it can handle floating point arithmetic (that is, numbers with a decimal point are not truncated). Because bc is interactive and reads commands from its standard input, the basic readability variables are substituted into a here-document which is fed to bc, and the output is captured in another environment variable. For example:

   233 :	Flesch=`bc << %%
   234 :	w = ($wordcount / $sentences)
   235 :	s = ($sylcount / $wordcount)
   236 :	206.835 - 84.6  s - 1.015  w
   237 :	%%
   238 :   `

analyze also prints the output from the tests, as follows:

   ARI = -10.43
   Kincaid= -7.01
   Coleman-Liau = -17.00
   Flesch Reading Ease = 184.505

Depending on the setting of $LOG (the variable that controls file logging) the output is printed to the terminal, or printed to the terminal and a logfile (the name of which is set by the variable $LOGFILE.)