Linux Text Processing: grep, awk, sed, sort, and Friends

The Unix Philosophy in One Pipeline

The Linux command line is built on a 50-year-old idea: small tools that do one thing well, connected by pipes that pass text streams between them. Master four utilities — grep, awk, sed, and sort — and you can transform almost any text data without writing a script. Add uniq, wc, cut, and tr for the supporting cast.

This article is a working reference for Linux text processing on the command line, with the option flags that matter and the pipeline patterns that come up in production work.

`grep` — Find Lines That Match

grep "ERROR" app.log                          # lines containing ERROR
grep -i "error" app.log                       # case-insensitive
grep -v "DEBUG" app.log                       # invert: lines NOT matching
grep -r "TODO" src/                           # recursive search in directory
grep -n "def main" *.py                       # show line numbers
grep -E "^(WARN|ERROR)" app.log               # extended regex (alternation)
grep -A 3 -B 1 "ERROR" app.log                # show 3 lines after, 1 before
grep -c "404" access.log                      # just the count
grep -l "TODO" *.py                           # filenames only
grep -L "TODO" *.py                           # filenames that DON'T match

For most modern work, prefer ripgrep (rg) if installed — it’s faster, respects .gitignore, and uses sane defaults. But every Linux system ships with grep, so know it cold.

`awk` — Field-Oriented Filtering

Treat each input line as a record split into fields. $1 through $NF are the fields; default delimiter is whitespace.

awk '{print $1}' access.log                   # first column
awk '{print $1, $7}' access.log               # IP and URL
awk -F: '{print $1}' /etc/passwd              # custom field separator
awk '$3 > 1000 {print $1}' data.txt           # filter by numeric condition
awk 'NR==1 || NR==10' file.txt                # rows 1 and 10
awk 'END {print NR}' file.txt                 # total line count
awk '/error/ {count++} END {print count}' log # count matching lines
awk -F'[,;]' '{print $2}' data.csv            # multi-char delimiter

awk is a full programming language. For one-liners, the pattern condition { action } is the core. The default action is print $0; the default condition is true (every line).

`sed` — Stream Editor

sed 's/old/new/' file.txt                     # replace first occurrence per line
sed 's/old/new/g' file.txt                    # replace all occurrences
sed -i 's/old/new/g' file.txt                 # in-place edit (modifies file!)
sed -i.bak 's/old/new/g' file.txt              # in-place with backup
sed '/^#/d' config                            # delete comment lines
sed '5,10d' file.txt                          # delete lines 5-10
sed -n '5,10p' file.txt                       # print only lines 5-10
sed 's|/old/path|/new/path|g' file            # alternate delimiter (avoid escaping /)

Use sed -i with caution. It silently overwrites the file. Always test the substitution without -i first, OR use -i.bak so you have a backup if it goes wrong.

`sort` and `uniq`

sort file.txt                                 # alphabetical
sort -n file.txt                              # numeric (so 100 comes after 9)
sort -r file.txt                              # reverse
sort -k2 file.txt                             # by second column
sort -k2 -n -r file.txt                       # second column, numeric, reverse
sort -t: -k3 -n /etc/passwd                   # delimiter colon, third field, numeric
sort -u file.txt                              # unique (also see uniq)
uniq file.txt                                 # remove adjacent duplicates
uniq -c file.txt                              # prefix count of duplicates
uniq -d file.txt                              # show only duplicates
sort file.txt | uniq -c | sort -rn            # frequency-sort (the classic)

uniq only removes adjacent duplicates — you almost always want to sort first. The classic sort | uniq -c | sort -rn pipeline produces a frequency-ranked list, ubiquitous in log analysis.

Counting and Slicing

wc -l file.txt                                # line count
wc -w file.txt                                # word count
wc -c file.txt                                # byte count
head -20 file.txt                             # first 20 lines (default 10)
tail -20 file.txt                             # last 20 lines
tail -f app.log                               # follow new appends (live)
tail -f app.log | grep ERROR                  # live filtered tail
cut -d: -f1 /etc/passwd                       # field 1, colon-delimited
cut -c1-10 file.txt                           # first 10 characters per line

`tr` — Character-Level Transforms

echo "hello" | tr 'a-z' 'A-Z'                 # uppercase
echo "a,b,c" | tr ',' '\n'                    # comma to newline
tr -d '\r' < dos.txt > unix.txt                # strip Windows CR characters
tr -s ' '                                     # squeeze runs of spaces to one
tr -cd '[:print:]' < binary > ascii            # keep only printable chars

Putting It Together — Real Pipelines

Top 10 IPs hitting your nginx with 404s

grep " 404 " /var/log/nginx/access.log \
  | awk '{print $1}' \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -10

Active SSH users sorted by login count

last \
  | awk '$1 != "" && $1 != "wtmp" {print $1}' \
  | sort \
  | uniq -c \
  | sort -rn

Find every file containing “TODO” modified in the last week

find . -mtime -7 -type f -exec grep -l "TODO" {} \;

Replace a config value across many files

find /etc/nginx -name "*.conf" \
  -exec sed -i.bak 's/listen 80;/listen 443 ssl;/g' {} \;

`xargs` — Feed Output as Arguments

find . -name "*.tmp" | xargs rm                # delete found files
find . -name "*.tmp" -print0 | xargs -0 rm    # null-separated for filenames with spaces
echo "file1 file2" | xargs touch              # create both files
ls *.log | xargs -I {} mv {} archive/         # placeholder substitution

Always use -print0 with xargs -0 when dealing with files. Filenames with spaces, newlines, or special characters break the default whitespace-separated mode.

Common Pitfalls

grep regex vs literal. grep -F "1.2.3.4" for fixed-string match (no regex interpretation of dots). Otherwise 1.2.3.4 matches almost anything.
sed -i destroys files silently. Test without -i first.
uniq needs sorted input. Pipe through sort first or it only removes adjacent duplicates.
awk field reset on every line. Don’t expect $1 from one line to persist into the next.
tail -f doesn’t follow rotation. Use tail -F (capital) to handle log-rotated files.
xargs without -0 on filenames. One file with a space in the name and the whole pipeline misbehaves.

Conclusion

Five compounding habits:

Pipe everything. command | head -20 tests cheaply, then drop the head when you’re confident.
Always grep -i by default for log searches. Cases vary unpredictably.
Reach for awk when the data is column-oriented, sed for line-oriented edits, cut for the simplest field extraction.
sort | uniq -c | sort -rn is your friend. Memorize it.
Always test sed -i changes on a copy first, or use sed -i.bak.

Related Linux Admin troubleshooting

For common errors and fixes related to this topic, see:

Tags: #awk #grep #Linux #sed

Linux Text Processing: grep, awk, sed, sort, and Friends

The Unix Philosophy in One Pipeline

grep — Find Lines That Match

awk — Field-Oriented Filtering

sed — Stream Editor

sort and uniq

Counting and Slicing

tr — Character-Level Transforms