This page is primarily for researchers that work using the Linux/Unix command line. It first appeared on the lab website at Cornell University and is preserved here.
There is always more than one way to accomplish a computational task, but some solutions are better than others. For bioinformatics processing, we want solutions that are easy and quick to implement and efficient enough to get the job done quickly. You can always write a simple program, but for many tasks, using Linux/Unix shell commands provides quick solutions, without the need to explicitly open/parse files, set up data structures, etc. Instead, a one-liner command in your shell takes care of the overhead so that you write a small amount of code and move on.
The examples below are meant to illustrate some of the most useful shell commands for bioinformatics processing. The syntax is from bash, but most examples can easily be modified to work in other shells. This is by no means exhaustive and I hope to expand on this. Now, let’s look at some examples.
Update: see part 2 for further tricks.
Looping with for
Genomic datasets are often distributed with separate files for each chromosome, and often we need to perform the same operation on each one. To do this, we can use the shell’s for
loop construct. If we had files called data_chr[chromosome].txt, where [chromosome] ranges from 1 to 22, and we wanted to run ./command
on them, we would do:
for chr in {1..22}; do ./command data_chr$chr.txt; done
The above code loops over all the values between 1 and 22, setting the shell variable $chr
to each value in succession. The syntax for specifying the ranges of numbers is that {A..B} gives a range between A and B, where A and B are integer values (note that A < B or A > B are both valid; if A > B, the numbers are assigned in descending order).
We can also use for
to loop over file extensions. Suppose we had some PLINK format data called data.bed, data.bim, and data.fam. We could list these files individually by running:
for ext in bed bim fam; do ls data.$ext; done
Here the variable $ext
(for extension) is set in succession to bed
, bim
, and fam
. This particular example isn’t very useful, since we could have just typed ls data.*
and seen that these three files exist. What is useful is being able to rename the base dataset with one line of code in the shell. If I wanted to rename these files human_data.bed, human_data.bim, and human_data.fam, I could write:
for ext in bed bim fam; do mv -i data.$ext human_data.$ext; done
For those who know bash well, there are some more elaborate ways of doing the above renaming, but for
is easy to remember and does the job.
Another useful for
construct is to loop over groups of files. We can do the following to execute ./command
on every file with a .txt extension in our current directory:
for file in *.txt; do ./command $file; done
Extracting fields with awk
Many genomic file formats are text-based, with information about specific variables listed in columns in the file. Sometimes we need to extract only a specific variable (i.e., column or field) and feed this into some other command. We can do this with either awk
or cut
. We’ll start with awk
. Suppose we wanted to know the physical positions for all the variants in the PLINK file data.bim. This is column number 4 in that file, with the first column being number 1 (i.e., awk
and cut
do use one-based column indexes). To extract column number 4 from data.bim in awk
, do:
awk '{print $4}' data.bim
This will print the positions to the terminal, which is usually not what we want, so we can print them to a file called data_positions.txt:
awk '{print $4}' data.bim > data_positions.txt
Or we can pipe the output to a program, such as ./command
:
awk '{print $4}' data.bim | ./command
This will send the physical positions to stdin for ./command
.
The syntax here may seem a bit cumbersome, but awk
is very powerful: in fact, awk
is a programming language. Everything inside ‘{ … }’ is the program, and it is run on every line of the input file. By default, awk
splits each line up into fields that are separated by white space (arbitrary numbers of spaces and/or tabs), and assigns them to special variables named $1
, $2
, $3
, $4
, … $N
, where N
is the last field on the current line; it also stores the complete input line in $0
. There are several built-in commands, including print
. You can use print
to produce lines with arbitrary text that includes the field values from the input file:
awk '{ print "SNP",$2,"is on chromosome",$1,"at position",$4 }' data.bim
If the first entry in the .bim file was for SNP rs16823551, with the first column (chromosome) being 1, and the fourth column being 3141616, this would print: “SNP rs16823551 is on chromosome 1 at position 3141616”, and similarly for every other line in the .bim file. Commas in the print command insert spaces by default, and are optional, so:
awk '{ print "chromosome" $1 }' data.bim
Would print “chromosome1” for .bim file line described above, with no space between the string and the $1
field value.
We can also use awk
to extract only lines with certain field values. For example, if we had a file called data.vcf and wanted to pull out lines with a FILTER value of “PASS”, knowing that the FILTER field is column 7 in a VCF file, we would do:
awk '$7 == "PASS" { print }' data.vcf
Here, we’ve put a boolean expression outside the curly braces. awk
evaluates boolean expressions for each line and only executes the code in the curly braces on lines for which the boolean expression is true. Note that a print command with no arguments simply prints the original line.
The boolean expressions that precede the code to run on each line can be arbitrarily complex. For example, if we wanted to only pull out lines from data.vcf with a FILTER value of “PASS” for chromosome 5, we would do:
awk '$1 == 5 && $7 == "PASS" { print }' data.vcf
Here, only lines with a chromosome (first field in the VCF file) value of 5 and a FILTER value of “PASS” will be printed. Note that with no code, awk
will simply print the lines for which the initial boolean expression is true, so an equivalent and shorter version of the previous command is:
awk '$1 == 5 && $7 == "PASS"' data.vcf
awk
can be used with any delimiter, and the -F option specifies the delimiter. So, if we were dealing with a .csv (Comma Separated Values) file containing SNP ids in column 3 and we wanted to extract those ids, we would do:
awk -F, '{ print $3 }' data.csv
The -F, option tells awk
to split fields on the ‘,’ character.
Let’s finish the awk
examples by extracting multiple fields from a file. Returning again to data.vcf, the first five columns of the file tell us about the variants in the dataset, with remaining fields containing information about data quality and the data itself. If we wanted only the first five columns and nothing else, we could do the following:
awk '{ print $1,$2,$3,$4,$5 }' data.vcf
That’s a lot of $
signs and numbers and is a bit cumbersome. The good news is there’s a simpler way to do this.
Extracting fields with cut
The program cut
can be used to extract ranges of fields from a file. So continuing from the above example, we could instead do:
cut -f1-5 data.vcf
That’s a lot shorter. The -f
option to cut
specifies the range of fields to extract and is fairly flexible. For example, if we wanted to pull out fields 1 through 5 and 10 through 20 of data.txt, we would do:
cut -f1-5,10-20 data.txt
Or if we wanted all the fields starting from column 6 to the end of each line, we would do:
cut -f6- data.txt
Note that cut
assumes by default that fields are separated by a tab character. If the fields in data.txt were instead separated by a single space, we would do the following to get column 6 and higher:
cut -d' ' -f6- data.txt
This seems better than awk
in every respect for simple column extraction, but there is a limitation. cut
does not allow extraction of columns in a different order than they appear in the original file. The command cut -f5,2 data.txt
fails, although printing fields in a different order is possible in awk
.
More advanced awk
As mentioned above, awk
is very powerful, and we won’t go through most of its features, but let’s go through three more examples.
Suppose we have a file called data.txt and we want to break out information corresponding to each chromosome into separate files. We might want to use a for
loop to go through each chromosome number and then a boolean expression in awk
to only pull out lines corresponding to that chromosome. Let’s begin our task with a program that doesn’t quite work and then fix it. If column 2 of data.txt contains the chromosome numbers, and we want separate files for each chromosome, we might think that the following would work (note again the default awk
behavior of printing lines that match the given boolean expression):
for chr in {1..22}; do awk '$2 == $chr' data.txt > data_chr$chr.txt; done
This is almost correct, but the expression given to awk
is '$2 == $chr'
, and awk
doesn’t know anything about the shell variable $chr
. Instead of referencing a variable in the shell, we can explicitly assign variables for awk
to use with the -v
option:
for chr in {1..22}; do awk -v chr=$chr '$2 == chr' data.txt > data_chr$chr.txt; done
Not bad! We can provide awk
with as many -v
options as we want for all the variables we might care to assign:
for chr in {1..22}; do awk -v chr=$chr -v threshold=10 '$2 == chr && $4 > threshold' data.txt > data_chr$chr.txt; done
This breaks out each chromosome into a separate file and requires that the values in column 4 are greater than 10.
What if we have a file containing two columns with a count of “successes” and “attempts” for some process, each collected from a different source, and that we want to compute the average success rate among all the sources? We can use awk
to sum each column and then print the average at the end using a special syntax. If the file data.txt has the successes and attempts in columns 2 and 3, respectively, we could do this:
awk 'BEGIN { total_successes = total_attempts = 0; } { total_success += $2; total_attempts += $3 } END { print "Successes:", total_success, "Attempts:", total_attempts, "Rate:", total_success / total_attempts }' data.txt
awk
executes the code in the BEGIN
section before any lines in the input are processed, and the code in the END
section after reading the input. The BEGIN
section initializes two variables to 0, the main code section sums the columns on each line, and the END
section prints the totals and the rate.
The last example uses two of awk
‘s built-in variables. Sometimes we need to extract data but want to remove the first header line in a file, and, for files with variable numbers of fields, we may wish to pull out lines with some specific number of fields. If we have a file named data.txt and we want to omit the first line and to only process lines with 4 fields. We can do the following:
awk 'NR > 1 && NF == 4' data.txt
The variable NR
is the number of input records that have been processed: typically the current line number since records are broken up by the end of line character by default. The variable NF is the number of fields in the current record (line), and by default, fields are broken up by white space. A complete list of awk
‘s built-in variables is available here.
There is much more to awk
and shell scripting, but this is a start.
More shell tricks are available in Shell tricks for one-liner bioinformatics: part 2.