Node:Extract Program, Next:Simple Sed, Previous:History Sorting, Up:Miscellaneous Programs
Both this chapter and the previous chapter
(A Library of awk
Functions)
present a large number of awk
programs.
If you want to experiment with these programs, it is tedious to have to type
them in by hand. Here we present a program that can extract parts of a
Texinfo input file into separate files.
This Web page is written in Texinfo, the GNU project's document formatting language. A single Texinfo source file can be used to produce both printed and online documentation. Texinfo is fully documented in the book Texinfo--The GNU Documentation Format, available from the Free Software Foundation.
For our purposes, it is enough to know three things about Texinfo input files:
@
) is special in Texinfo, much as
the backslash (\
) is in C
or awk
. Literal @
symbols are represented in Texinfo source
files as @@
.
@c
or @comment
.
The file-extraction program works by using special comments that start
at the beginning of a line.
@group
and @end group
commands bracket
example text that should not be split across a page boundary.
(Unfortunately, TeX isn't always smart enough to do things exactly right,
and we have to give it some help.)
The following program, extract.awk
, reads through a Texinfo source
file and does two things, based on the special comments.
Upon seeing @c system ...
,
it runs a command, by extracting the command text from the
control line and passing it on to the system
function
(see Input/Output Functions).
Upon seeing @c file filename
, each subsequent line is sent to
the file filename, until @c endfile
is encountered.
The rules in extract.awk
match either @c
or
@comment
by letting the omment
part be optional.
Lines containing @group
and @end group
are simply removed.
extract.awk
uses the join
library function
(see Merging an Array into a String).
The example programs in the online Texinfo source for GAWK: Effective AWK Programming
(gawk.texi
) have all been bracketed inside file
and
endfile
lines. The gawk
distribution uses a copy of
extract.awk
to extract the sample programs and install many
of them in a standard directory where gawk
can find them.
The Texinfo file looks something like this:
... This program has a @code{BEGIN} rule, that prints a nice message: @example @c file examples/messages.awk BEGIN @{ print "Don't panic!" @} @c end file @end example It also prints some final advice: @example @c file examples/messages.awk END @{ print "Always avoid bored archeologists!" @} @c end file @end example ...
extract.awk
begins by setting IGNORECASE
to one, so that
mixed upper- and lowercase letters in the directives won't matter.
The first rule handles calling system
, checking that a command is
given (NF
is at least three) and also checking that the command
exits with a zero exit status, signifying OK:
# extract.awk --- extract files and run programs # from texinfo files BEGIN { IGNORECASE = 1 } /^@c(omment)?[ \t]+system/ \ { if (NF < 3) { e = (FILENAME ":" FNR) e = (e ": badly formed `system' line") print e > "/dev/stderr" next } $1 = "" $2 = "" stat = system($0) if (stat != 0) { e = (FILENAME ":" FNR) e = (e ": warning: system returned " stat) print e > "/dev/stderr" } }
The variable e
is used so that the function
fits nicely on the
page.
screen.
The second rule handles moving data into files. It verifies that a
file name is given in the directive. If the file named is not the
current file, then the current file is closed. Keeping the current file
open until a new file is encountered allows the use of the >
redirection for printing the contents, keeping open file management
simple.
The for
loop does the work. It reads lines using getline
(see Explicit Input with getline
).
For an unexpected end of file, it calls the unexpected_eof
function. If the line is an "endfile" line, then it breaks out of
the loop.
If the line is an @group
or @end group
line, then it
ignores it and goes on to the next line.
Similarly, comments within examples are also ignored.
Most of the work is in the following few lines. If the line has no @
symbols, the program can print it directly.
Otherwise, each leading @
must be stripped off.
To remove the @
symbols, the line is split into separate elements of
the array a
, using the split
function
(see String Manipulation Functions).
The @
symbol is used as the separator character.
Each element of a
that is empty indicates two successive @
symbols in the original line. For each two empty elements (@@
in
the original file), we have to add a single @
symbol back in.
When the processing of the array is finished, join
is called with the
value of SUBSEP
, to rejoin the pieces back into a single
line. That line is then printed to the output file:
/^@c(omment)?[ \t]+file/ \ { if (NF != 3) { e = (FILENAME ":" FNR ": badly formed `file' line") print e > "/dev/stderr" next } if ($3 != curfile) { if (curfile != "") close(curfile) curfile = $3 } for (;;) { if ((getline line) <= 0) unexpected_eof() if (line ~ /^@c(omment)?[ \t]+endfile/) break else if (line ~ /^@(end[ \t]+)?group/) continue else if (line ~ /^@c(omment+)?[ \t]+/) continue if (index(line, "@") == 0) { print line > curfile continue } n = split(line, a, "@") # if a[1] == "", means leading @, # don't add one back in. for (i = 2; i <= n; i++) { if (a[i] == "") { # was an @@ a[i] = "@" if (a[i+1] == "") i++ } } print join(a, 1, n, SUBSEP) > curfile } }
An important thing to note is the use of the >
redirection.
Output done with >
only opens the file once; it stays open and
subsequent output is appended to the file
(see Redirecting Output of print
and printf
).
This makes it easy to mix program text and explanatory prose for the same
sample source file (as has been done here!) without any hassle. The file is
only closed when a new data file name is encountered or at the end of the
input file.
Finally, the function unexpected_eof
prints an appropriate
error message and then exits.
The END
rule handles the final cleanup, closing the open file:
function unexpected_eof() { printf("%s:%d: unexpected EOF or error\n", FILENAME, FNR) > "/dev/stderr" exit 1 } END { if (curfile) close(curfile) }