1. sed.js

GNU sed stream editor compiled to JavaScript

May be invoked with the following command-line options:

--version
Print out the version of sed that is being run and a copyright notice, then exit.
--help
Print a usage message briefly summarizing these command-line options and the bug-reporting address, then exit.
-n --quiet --silent
By default, sed prints out the pattern space at the end of each cycle through the script. These options disable this automatic printing, and sed only produces output when explicitly told to via the p command.
-e script --expression=script
Add the commands in script to the set of commands to be run while processing the input.
--posix
GNU sed includes several extensions to POSIX sed. In order to simplify writing portable scripts, this option disables all the extensions that this manual documents, including additional commands. Most of the extensions accept sed programs that are outside the syntax mandated by POSIX, but some of them (such as the behavior of the N command described in see Reporting Bugs) actually violate the standard. If you want to disable only the latter kind of extension, you can set the POSIXLY_CORRECT variable to a non-empty value.
-r --regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that egrep accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable. See Extended regular expressions.

If no -e --expression options are given on the command-line, then the first non-option argument on the command line is taken to be the script to be executed.

2. Programs

A sed program consists of one or more sed commands, passed in by one or more of the -e --expression options, or the first non-option argument if zero of these options are used.

Commands within a script can be separated by semicolons (;) or newlines (ASCII 10). Some commands, due to their syntax, cannot be followed by semicolons working as command separators and thus should be terminated with newlines or be placed at the end of a script. Commands can also be preceded with optional non-significant whitespace characters.

Each sed command consists of an optional address or address range, followed by a one-character command name and any additional command-specific code.

sed maintains two data buffers: the active pattern space, and the auxiliary hold space. Both are initially empty.

sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; each command can have an address associated to it: addresses are a kind of condition code, and a command is only executed if the condition is verified before the command is to be executed.

When the end of the script is reached, unless the -n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed. Then the next cycle starts for the next input line.

Unless special commands (like ‘D’) are used, the pattern space is deleted between two cycles. The hold space, on the other hand, keeps its data between cycles (see commands ‘h’, ‘H’, ‘x’, ‘g’, ‘G’ to move data between both buffers).

2.1. Selecting lines

Addresses in a sed script can be in any of the following forms:

number
Specifying a line number will match only that line in the input.
first~step
This GNU extension matches every stepth line starting with line first. In particular, lines will be selected when there exists a non-negative n such that the current line-number equals first + (n * step). Thus, to select the odd-numbered lines, one would use 1~2; to pick every third line starting with the second, 2~3 would be used; to pick every fifth line starting with the tenth, use 10~5; and 50~0 is just an obscure way of saying 50.
$
This address matches the last line of input.
/regexp/
This will select any line which matches the regular expression regexp. If regexp itself includes any / characters, each must be escaped by a backslash (\).
\%regexp%
(The % may be replaced by any other single character.) This also matches the regular expression regexp, but allows one to use a different delimiter than /. This is particularly useful if the regexp itself contains a lot of slashes, since it avoids the tedious escaping of every /. If regexp itself includes any delimiter characters, each must be escaped by a backslash (\).
/regexp/I \%regexp%I
The I modifier to regular-expression matching is a GNU extension which causes the regexp to be matched in a case-insensitive manner.
/regexp/M \%regexp%M
The M modifier to regular-expression matching is a GNU sed extension which causes ^ and $ to match respectively (in addition to the normal behavior) the empty string after a newline, and the empty string before a newline. There are special character sequences (\` and \') which always match the beginning or the end of the buffer. M stands for multi-line.

If no addresses are given, then all lines are matched; if one address is given, then only lines matching that address are matched.

An address range can be specified by specifying two addresses separated by a comma (,). An address range matches lines starting from where the first address matches, and continues until the second address matches (inclusively).

If the second address is a regexp, then checking for the ending match will start with the line following the line which matched the first address: a range will always span at least two lines (except of course if the input stream ends).

If the second address is a number less than (or equal to) the line matching the first address, then only the one line is matched.

GNU sed also supports some special two-address forms; all these are GNU extensions:

0,/regexp/
A line number of 0 can be used in an address specification like 0,/regexp/ so that sed will try to match regexp in the first input line too. In other words, 0,/regexp/ is similar to 1,/regexp/, except that if addr2 matches the very first line of input the 0,/regexp/ form will consider it to end the range, whereas the 1,/regexp/ form will match the beginning of its range and hence make the range span up to the second occurrence of the regular expression.

Note that this is the only place where the 0 address makes sense; there is no 0-th line and commands which are given the 0 address in any other way will give an error.
addr1,+N
Matches addr1 and the N lines following addr1.
addr1,~N
Matches addr1 and the lines following addr1 until the next line whose input line number is a multiple of N.

Appending the ! character to the end of an address specification negates the sense of the match. That is, if the ! character follows an address range, then only lines which do not match the address range will be selected. This also works for singleton addresses, and, perhaps perversely, for the null address.

2.2. Regular Expression Syntax

To know how to use sed, people should understand regular expressions (regexp for short). A regular expression is a pattern that is matched against a subject string from left to right. Most characters are ordinary: they stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern

The quick brown fox

matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of special characters, which do not stand for themselves but instead are interpreted in some special way. Here is a brief description of regular expression syntax as used in sed.

char
A single ordinary character matches itself.
*
Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by \, a ., a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by *; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.
\+
As *, but matches one or more. It is a GNU extension.
\?
As *, but only matches zero or one. It is a GNU extension.
\{i\}
As *, but matches exactly i sequences (i is a decimal integer; for portability, keep it between 0 and 255 inclusive).
\{i,j\}
Matches between i and j, inclusive, sequences.
\{i,\}
Matches more than or equal to i sequences.
$regexp$
Groups the inner regexp as a whole, this is used to:
- Apply postfix operators, like $abcd$*: this will search for zero or more whole sequences of ‘abcd’, while abcd* would search for ‘abc’ followed by zero or more occurrences of ‘d’. Note that support for $abcd$* is required by POSIX 1003.1-2001, but many non-GNU implementations do not support it and hence it is not universally portable.
- Use back references (see below).
.
Matches any character, including newline.
^
Matches the null string at beginning of the pattern space, i.e. what appears after the circumflex must appear at the beginning of the pattern space.

In most scripts, pattern space is initialized to the content of each line (see How sed works). So, it is a useful simplification to think of ^#include as matching only lines where ‘#include’ is the first thing on line—if there are spaces before, for example, the match fails. This simplification is valid as long as the original content of pattern space is not modified, for example with an s command.

^ acts as a special character only at the beginning of the regular expression or subexpression (that is, after \( or \|). Portable scripts should avoid ^ at the beginning of a subexpression, though, as POSIX allows implementations that treat ^ as an ordinary character in that context.
$
It is the same as ^, but refers to end of pattern space. $ also acts as a special character only at the end of the regular expression or subexpression (that is, before \) or \|), and its use at the end of a subexpression is not portable.
[list] [^list]
Matches any single character in list: for example, [aeiou] matches all vowels. A list may include sequences like char1-char2, which matches any character between (inclusive) char1 and char2.

A leading ^ reverses the meaning of list, so that it matches any single character not in list. To include ] in the list, make it the first character (after the ^ if needed), to include - in the list, make it the first or last; to include ^ put it after the first character.

The characters $, *, ., [, and \ are normally not special within list. For example, [\*] matches either ‘\’ or ‘*’, because the \ is not special here. However, strings like [.ch.], [=a=], and [:space:] are special within list and represent collating symbols, equivalence classes, and character classes, respectively, and [ is therefore special within list when it is followed by ., =, or :. Also, when not in POSIXLY_CORRECT mode, special escapes like \n and \t are recognized within list.
regexp1\|regexp2
Matches either regexp1 or regexp2. Use parentheses to use complex alternative regular expressions. The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. It is a GNU extension.
regexp1regexp2
Matches the concatenation of regexp1 and regexp2. Concatenation binds more tightly than \|, ^, and $, but less tightly than the other regular expression operators.
\digit
Matches the digit-th $...$ parenthesized subexpression in the regular expression. This is called a back reference. Subexpressions are implicity numbered by counting occurrences of \( left-to-right.
\n
Matches the newline character.
\char
Matches char, where char is one of $, *, ., [, \, or ^. Note that the only C-like backslash sequences that you can portably assume to be interpreted are \n and \\; in particular \t is not portable, and matches a ‘t’ under most implementations of sed, rather than a tab character.

Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest.

Default Output

GIST | sed sends its results to the screen by default

Printing Lines

GIST | sed has printed each line twice now. This is because it automatically prints each line, and then we've told it to print explicitly with the p command.

GIST | We can clean up the results by passing the -n option to sed, which suppresses the automatic printing

Address Ranges

GIST | Let's modify the output by only having sed print the first line.

GIST | We've just given an address range to sed. If we give sed an address, it will only perform the commands that follow on those lines. In this example, we've told sed to print line 1 through line 5.

GIST | This will result in the same output, because we've told sed to start at line 1 and then operate on the next 4 lines as well.

GIST | If we want to print every other line, we can specify the interval after the ~ character. The following line will print every other line starting with line 1.

Deleting Text

GIST | We can easily perform text deletion where we previously were specifying text printing by changing the p command to the d command. We no longer need the -n command because with the delete command, sed will print everything that is not deleted, which will help us see what's going on. We can modify the last command from the previous section to make it delete every other line starting with the first. The result is that we should be given every line we were not given last time.

Substituting Text

GIST | Now let's substitute the expression "o" with "@"

GIST | To make sed replace every instance of "o" instead of just the first on each line, we can pass an optional g flag to the substitute command.

GIST | If we only wanted to change the second instance of "o" that sed finds on each line, then we could use the number 2 instead of the g.

GIST | If we only want to see which lines were substituted, we can use the -n option again to suppress automatic printing. We can then pass the p flag to the substitute command to print lines where substitution took place..

GIST | If we want the search process to ignore case, we can pass it the i flag.

Referencing Matched Text

GIST | If we wish to find more complex patterns with regular expressions, we have a number of different methods of referencing the matched pattern in the replacement text. For instance, if we want to match the from the beginning of the line to "on" we can use the expression.

GIST | Since you don't know the exact phrase that will match in the search string, you can use the "&" character to represent the matched text in the replacement string. This example shows how to put parentheses around the matched text.

GIST | A more flexible way of referencing matched text is to use escaped parentheses to group sections of matched text. Every group of search text marked with parentheses can be referenced by an escaped reference number. For instance, the first parentheses group can be referenced with "\1", the second with "\2" and so on. In this example, we'll switch the first two words of each line.

GIST | As you can see, previous results are not perfect. For instance, the second line skips the first word because it has a character not listed in our character set. Similarly, it treated "they'll" as two words in the fifth line. Let's improve the regular expression to be more accurate

Supplying Multiple Editing Sequences

GIST | We can string various commands to sed by using the "-e" option before each command.

GIST | Another approach to stringing commands together is using a semi-colon character (;) to separate distinct commands. This works the same as above, but the "-e" is not required.

Advanced Addressing

GIST | One of the advantages of sed's addressable commands is that regular expressions can be used as selection criteria. This means that we are not limited to operating on known line values, like we learned previously.

GIST | We can, instead, use regular expressions to match only lines that contain a certain pattern. We do this by placing our match pattern between two forward slashes (/) prior to giving the command strings.

GIST | This example demonstrates using regular expressions to generate addresses for other commands. This matches any blank lines (the start of a line followed immediately by the end of the line) and passes them to the delete command.

GIST | We can delete lines starting at a line that only contains the word "START" until a line reading "END" by issuing the following command.

Using the Hold Buffer

One piece of functionality that increases sed's ability perform multi-line aware edits is what is called the "hold buffer". The hold buffer is an area of temporary storage that can be modified by certain commands.

The presence of this extra buffer means that we can store lines while working on other lines, and then operate on each buffer as necessary.

The following are the commands that affect the holding buffer:

h: Copies the current pattern buffer (the line we're currently matched and working on) into the the holding buffer (this erases the previous contents of the hold buffer).
H: Appends the current pattern buffer to the end of the current holding pattern, separated by a new-line (\n) character.
g: Copies the current holding buffer into the current pattern buffer. The previous pattern buffer is erased.
G: Appends the current holding pattern to the end of the current pattern buffer, separated by a new-line (\n) character.
x: Swap the current pattern and holding buffers.

The contents of the holding buffer cannot be operated on until it is moved to the pattern buffer in one way or another.

GIST | This is a procedural example of how to join adjacent lines (sed actually has a built-in command that would take care of a lot of this for us. The "N" command appends the next line to the current line. We are going to do things the hard way though for the sake of practice)

The first thing to note is that the "-n" option is used to suppress automatic printing. Sed will only print when we specifically tell it too.

The first part of the script is "1~2h". The beginning is an address specification meaning to perform the subsequent operation on the first line, and then on every other line afterwards (each odd numbered line). The "h" part is the command to copy the matched line into the holding buffer.

The second half of the command is more complex. Again, it begins with an address specification. This time, it is referring to the even numbered lines (the opposite of the first command).

The rest of the command is enclosed in braces. This means that the rest of the commands will inherit the address that was just specified. Without the braces, only the "H" command would inherit the address, and the rest of the commands would be executed on every line.

The "H" command copies a new-line character, followed by the current pattern buffer, onto the end of the current holding pattern.

This holding pattern (an odd numbered line, followed by a new-line character, followed by an even numbered line) is then copied back into the pattern buffer (replacing the previous pattern buffer) with the "g" command.

Next, the new-line character is replaced with a space and the line is printed with the "p" command.

GIST | If you are curious, using the "N" command, as we described above, would shorten this considerably. This command will produce the same results that we've just seen.