This page is primarily for researchers that work using the Linux/Unix command line. It first appeared on the lab website at Cornell University and is preserved here.
This post gives more examples of GNU Linux/Unix shell commands applicable to bioinformatics processing. Here, we will go through examples of the tools grep
, sort
, uniq
, wc
, join
, and combinations of these, including their use in performing set operations on two or more files. We also give a quick introduction to the screen
utility. The examples below assume that the reader is familiar with part 1, which discusses for loops, awk
, and cut
. As before, we will use bash syntax; several examples are readily transferable to other shells.
Searching files with grep
Often data or code are contained in many files and we do not recall which file contains a string or pattern we are looking for. grep
searches a file or files for a string or, more generally, for a regular expression. In the simplest case, we want to search for some string within a given file. For example, if we wanted to find the string “awk” in ~/.bash_history, we would do:
grep awk ~/.bash_history
The above would print out all lines in which the text “awk” occurs anywhere. If we instead wanted to search for “awk” in the files with names ending in .sh in the current directory, we would do:
grep awk *.sh
This would print all lines in which “awk” occurs, preceded by the filename on which the given line appears.
grep
has many features that are quite useful, especially its ability to search using regular expressions. This post does not include details of regular expressions, but they are very powerful and worth learning. We provide one example here. Suppose we ran a tool that produced many pieces of information about a dataset, with each line prefixed by the type of information that line contains. If that file is called “output” and we are currently interested in lines that begin with the string “TSTV”, we could use the following to print out only those lines:
grep "^TSTV" output
This is a regular expression: the ^
at the beginning of the search string indicates that the string that follows should match at the beginning of the strings. Since grep
searches for matches within each successive line by default, this will match any line that starts with “TSTV”.
Searching files with sort
The program sort
sorts a file, as its name suggests. In the context of bioinformatics, sort
can be used on its own to order a file based on a set of fields within it. For example, suppose the file “sites.txt” contains a set of sites that we wish to sort first by chromosome number and then, within each chromosome, by physical position. If column 2 contains the chromosome and column 3 the physical position, the following produces a file called “sites-sort.txt” that is sorted in this way:
sort -k 2n -k 3n sites.txt > sites-sort.txt
The -k
option tells sort which column to sort by. By default, sort
uses alphabetical sorting in which the string “1000” comes before “11”, which is not our aim. We therefore use -k 2n
, with the n
indicating numerical sort. If sort
receives multiple -k
options, it applies the sorting in the given order. Thus, the above command sorts first on column 2 using a numerical sort followed by a numerical sort on column 3.
Although bioinformatic data are often sorted before we analyze it, sort
is essential to using uniq
, join
, (each discussed below) and other shell tools such as comm
.
Getting unique lines and counting with uniq
By default, the uniq
command gets a list of the unique lines from sorted input. Run in this way, this command isn’t very useful for bioinformatics, but it has several options that are helpful. Suppose we have a file call “samples.ind” with a list of samples, and that column 3 gives the population of each sample. (This is the format of Eigenstrat .ind
files.) If we want to get a count of the numbers of samples from each population, we can do:
awk '{print $3}' samples.ind | sort | uniq -c
Here, the -c
option counts the number of copies of each unique line that is input to uniq
, and we use awk
to extract the populations, thereby getting counts of the number of samples from each. This example also shows piping, which was introduced in part 1, but here we conveniently pipe between multiple commands in order to avoid writing intermediate files.
Set operations on files with uniq
Now suppose we have a set of markers in “data1.bim” and “data2.bim” with marker ids in column 2. We could determine the set of markers that are shared in common between these two files—assuming they use the same marker ids for the same variants—in the following way:
cat data1.bim data2.bim | awk '{print $2}' | sort | uniq -d > intersection
Here, we use cat
to concatenate the two files, extract column 2 from this input, sort this column—the set of marker ids—and pass them to uniq -d
. The -d
option causes uniq to print a line only if the given line appears more than once. As long as the marker ids appear only once in each file, duplicates will only occur for markers in both datasets.
The above is an example of generating the intersection of the two marker sets. To perform set union, we can do the following:
cat data1.bim data2.bim | awk '{print $2}' | sort | uniq > union
We may wish to determine the set of markers that are in “data1.bim” but not in “data2.bim”; i.e., to perform set subtraction. We can do this using a two step process: first compute the intersection, which we did above. With the intersection stored in a file called “intersection”, we can now determine the markers that are unique to “data1.bim” in the following way:
awk '{print $2}' data1.bim | cat - intersection | sort | uniq -u > data1-only
The above extracts the marker ids from “data1.bim”, concatenates them with the markers from the intersection, sorts these (necessary to use uniq
), and then passes them to uniq -u
. The -u
option causes uniq
to print only lines that appear once. Since the marker ids in “intersection” are guaranteed to be in “data1.bim”, they will appear more than once to uniq
and will not be printed. The suppressed sites are exactly those that are also in “data2.bim”, so this produces the set subtraction we desire. (Note that for many shell utilities, ‘-‘ represents stdin instead of a file, so the piped input from awk
gets concatenated with “intersection” in the second part of the command.)
Combining information across files using join
Consider a situation in which two files give information about the same set of samples. We may wish to combine these into one file. Here we can use join
, which matches fields between two files, producing lines that combine all fields from the two lines that match between the input files.
This is best illustrated by example. Suppose we have files “BMI.txt” and “height.txt” which both contain two fields: the sample id and the given phenotype (BMI and height for the two respective files). join
requires that the input files are sorted for the field that is joined on, so we can do the following:
sort -k 1 BMI.txt > BMI.sort.txt
sort -k 1 height.txt > height.sort.txt
join -j 1 BMI.sort.txt height.sort.txt > BMI_height.txt
The -j 1
option tells join
to find matching strings in field 1 in each of the input files. Note that -j 1
is the default, so this last line could also have been
join BMI.sort.txt height.sort.txt > BMI_height.txt
Often the fields that we wish join
to match on are different between the two files. Consider a file “pheno.txt” that contains many fields (e.g., phenotypes), with the sample id in column 3. If we wish to create a new file that adds the BMI information to that contained in “pheno.txt”, we could do the following:
sort -k 3 pheno.txt > pheno.sort.txt
join -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_BMI.txt
The -1 3 -2 1
options tell join
to identify matches using field 3 within the first supplied file—here that’s “pheno.sort.txt”—and using field 1 within the second supplied file.
The resulting output can shift the field order somewhat: the matching field always (a) occurs once even though it is present in both fields, and (b) is field 1 in the output from join
. Other than this shift of the match field, the other fields from file 1 occur in their original order, and these are followed by the fields from file 2 (omitting the match field) in their original order.
By default, join
prints only lines that have a matching field in both files. Thus, it performs an intersection on an arbitrary field, which can be very convenient in general. However, sometimes we wish to retain all the information from one of the files.
If, in the above example, “pheno.txt” contained information on more samples than appear in “BMI.txt”, and we wish to retain all the information in “pheno.txt”, we could do the following:
join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_all_BMI.txt
The -a #
option tells join
to print all lines from the indicated file number (1 or 2), including those that do not match any line from the other file. Note that any unmatched lines printed as a result of -a #
will have fewer fields than the other lines. If we know how many fields occur on matched lines, we can use awk to fill in, as in the following adaptation of the above:
join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt | awk '{ if (NF < 10) { print $0,NA } else { print } }' > pheno_all_BMI2.txt
This assumes that there are 10 fields for lines that successfully join “BMI.sort.txt” with “pheno.sort.txt”. Recall from part 1 that the awk
variable NF
gives the number of fields on each line.
Reverting to the original line order after using join
Using join
on files that have a specific order that must be maintained presents a problem. For example, a PLINK .fam
file describes the order of samples in a corresponding .bed
file, so we cannot reorder the .fam
file. A workaround is to keep a field that lists the original line number and to numerically sort by this field once we finish modifying the lines.
For example, suppose “data.fam” is a PLINK .fam
file that does not have phenotype information in column 6. If we have a separate file “case-control.txt” that contains the sample ids and the (1 or 2) case-control status of these samples, we can do the following (note that the sample ids in the PLINK .fam
file are in column 2):
sort -k 1 case-control.txt > case-control.sort.txt
awk '{print $0,NR}' data.fam | sort -k 2 | join -1 2 -2 1 - case-control.sort.txt | sort -k 7n | awk '{print $2,$1,$3,$4,$5,$8}' > data-pheno.fam
The second line is long, but it does what we want. We first add a field with the line number (the NR
variable in awk
) to the .fam
data; this gives us 7 fields. Next, we sort these lines on column 2, the field containing the sample id which we wish to join
on. Then we run join
, indicating that field 2 is to be matched from the lines coming from stdin (the ‘-‘ file indicates stdin should be used for file 1), and matching on field 1 for “case-control.sort.txt”. This gives us all the information we need, but the lines are out of order compared to the original “data.fam” and have extra fields. We fix this by numerically sorting on column 7, which gives the original line numbers, putting the lines back in their original order. Lastly, we pipe this to awk
and reorder the columns. As noted above, join
reorders the fields slightly, with field 1 being the matched field. In this case then, field 1 is the sample id, which needs to be field 2 in the resulting .fam
file, so we print field 2 first, which is the PLINK format family id, followed by field 1, the sample id. We know the family id has to be field 2 since the fields from each file maintain their order (with the exception of the match field). We can then print fields 3, 4, and 5 in the same order in the output. Field 6 in PLINK fam files gives the case-control phenotype. This is the information we used join
to get from “case-control.txt”, and it must be the last field: field 8, since the input before running join
had 7 fields. And there we have it, a new .fam
file with case-control status added from a separate file.
Redirecting output in bash: process substitution
In many examples above, we used a separate sort
command to produce a file that we then used with join
. It is possible to sort
both files on the fly using one line in bash, as we explain below, a convenience that allows us to avoid storing the sorted files that we likely don’t need. This works using a syntax that resembles I/O redirection (i.e., < and >) and is called process substitution.
For the first join
command given above, we first generated files called “BMI.sort.txt” and “height.sort.txt”. Assuming we don’t need these sorted files after the join command, what we really want to do is to give the output of both these commands to join
directly. This is possible in bash using the following command:
join -j 1 <(sort -k 1 BMI.txt) <(sort -k 1 height.txt) > BMI_height.txt
These process substitutions—the two sort commands given in the line above—can be used in contexts where a filename would otherwise appear. (In fact, bash creates a temporary file descriptor with the contents produced by the process within the <(...)
construct.)
Counting with wc
The final example is of a simple command: wc
for word count. This simple program lists the number of lines, words (defined as a sequence of characters delimited by white space), and characters for an input file. Often line counts are useful in bioinformatics work, and wc -l
gives only the line count of the input and is slightly more efficient than using wc
alone.
Given PLINK format files “data.fam” and “data.bim”, a quick way to determine the number of samples and the number of markers contained in the dataset is to use the following two commands:
wc -l data.fam
wc -l data.bim
wc
is also useful in combination with the set operations mentioned above, where we can pipe the output of these operations to wc -l
to get counts in lieu of producing files when these are not needed.
Retaining shell sessions via screen
A useful utility in the context of shell-based work in Linux/UNIX is the program screen
. screen
has many features, but the feature that is perhaps most useful is its ability to keep interactive shells running even after a user has logged out. screen
gives the ability to log into a remote system, do interactive work in a shell, and disconnect while retaining the shell session. Thus, if you’re connected to a computer and started a long-running task, you can safely disconnect so long as that task is running within screen
.
To begin, simply run
screen
on the command line. This will start a new shell (most likely bash) within screen
. From here you can do the work as normal. Later, when you wish to disconnect while leaving the bash instance running, type Ctrl-a d
. This will disconnect you from the screen
instance while leaving it running in the background. When you return to the work, you can reconnect to this shell (assuming you have only one screen
session running) by executing
screen -r
This will reattach the background screen
session to your current terminal and you can proceed with the work you were doing. Note that a nice advantage of this is that you can start screen
on one node, say a login node for a cluster, and then within the screen
session, connect to another machine. When you detach from the screen
session, you will be back on the login node where screen
is running, but the shell within that screen
instance will still be connected to the other machine.
You can run as many screen
instances as you wish, but reattaching when there are multiple sessions requires specifying which session you want. To list all the running screen
sessions, type
screen -ls
In general, with multiple screen
instances running, you can reattach a session with the command
screen -r -d [pid.tty.host]
where the [pid.tty.host]
is that listed by screen -ls
. The -d
option detaches the indicated screen
session if it is attached to another terminal. The latter can happen when you are unexpectedly disconnected from the internet or if that session is attached on some other terminal (possibly on another computer).
While screen
offers a number of features, these are the most basic. More information is given in the Screen User’s Manual. Note that within a running screen
session, typing Ctrl-a
starts to issue commands to screen
itself. Thus, Ctrl-a
will not move your cursor to the beginning of the line as it does in a standard bash session (users that have a keyboard with the Home
key can press this instead).