DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 
Using awk

Generating reports

awk is especially useful for producing reports that summarize and format information. Suppose you want to produce a report from the file countries, that lists the continents alphabetically, and after each continent, its countries in decreasing order of population, like this:

   Africa:
           Sudan          19
           Algeria        18
   

Asia: China 866 India 637 CIS 262

Australia: Australia 14

North America: USA 219 Canada 24

South America: Brazil 116 Argentina 26

As with many data processing tasks, it is much easier to produce this report in several stages. First, create a list of continent-country-population triples, in which each field is separated by a colon. To do this, use the following program, triples, which uses an array pop, indexed by subscripts of the form 'continent:country' to store the population of a given country.

The print statement in the END section of the program creates the list of continent-country-population triples that are piped to the sort routine:

   BEGIN  { FS = "\t" }
          { pop[$4 ":" $1] += $3 }
   END    { for (cc in pop)
               print cc ":" pop[cc] | "sort -t: +0 -1 +2nr" }
The arguments for sort deserve special mention. The -t: argument tells sort to use : as its field separator. The +0 -1 arguments make the first field the primary sort key. In general, +i -j makes fields i+1, i+2, ..., j the sort key. If -j is omitted, the fields from i+1 to the end of the record are used. The +2nr argument makes the third field, numerically decreasing, the secondary sort key (n is for numeric, r for reverse order). Invoked on the file countries, this program produces as output:
   Africa:Sudan:19
   Africa:Algeria:18
   Asia:China:866
   Asia:India:637
   Asia:CIS:262
   Australia:Australia:14
   North America:USA:219
   North America:Canada:24
   South America:Brazil:116
   South America:Argentina:26
This output is in the right order but the wrong format. To transform the output into the desired form, run it through a second awk program, format:
   BEGIN  { FS = ":" }
   {      if ($1 != prev) {
               print "\n" $1 ":"
               prev = $1
          }
          printf "\t\t%-10s %6d\n", $2, $3
   }
This is a control-break program that prints only the first occurrence of a continent name and formats the country-population lines associated with that continent in the desired manner. The following command line produces the report:

awk -f triples countries | awk -f format

As this example suggests, complex data transformation and formatting tasks can often be reduced to a few simple awk and sort operations.


Next topic: Word frequencies
Previous topic: Example applications

© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003