Shell tricks for one-liner bioinformatics: part 2

This page is primarily for researchers that work using the Linux/Unix command line. It first appeared on the lab website at Cornell University and is preserved here.

This post gives more examples of GNU Linux/Unix shell commands applicable to bioinformatics processing. Here, we will go through examples of the tools grep, sort, uniq, wc, join, and combinations of these, including their use in performing set operations on two or more files. We also give a quick introduction to the screen utility. The examples below assume that the reader is familiar with part 1, which discusses for loops, awk, and cut. As before, we will use bash syntax; several examples are readily transferable to other shells.

Searching files with `grep`

Often data or code are contained in many files and we do not recall which file contains a string or pattern we are looking for. grep searches a file or files for a string or, more generally, for a regular expression. In the simplest case, we want to search for some string within a given file. For example, if we wanted to find the string “awk” in ~/.bash_history, we would do:
grep awk ~/.bash_history

The above would print out all lines in which the text “awk” occurs anywhere. If we instead wanted to search for “awk” in the files with names ending in .sh in the current directory, we would do:
grep awk *.sh

This would print all lines in which “awk” occurs, preceded by the filename on which the given line appears.

grep has many features that are quite useful, especially its ability to search using regular expressions. This post does not include details of regular expressions, but they are very powerful and worth learning. We provide one example here. Suppose we ran a tool that produced many pieces of information about a dataset, with each line prefixed by the type of information that line contains. If that file is called “output” and we are currently interested in lines that begin with the string “TSTV”, we could use the following to print out only those lines:
grep "^TSTV" output

This is a regular expression: the ^ at the beginning of the search string indicates that the string that follows should match at the beginning of the strings. Since grep searches for matches within each successive line by default, this will match any line that starts with “TSTV”.

Searching files with `sort`

The program sort sorts a file, as its name suggests. In the context of bioinformatics, sort can be used on its own to order a file based on a set of fields within it. For example, suppose the file “sites.txt” contains a set of sites that we wish to sort first by chromosome number and then, within each chromosome, by physical position. If column 2 contains the chromosome and column 3 the physical position, the following produces a file called “sites-sort.txt” that is sorted in this way:
sort -k 2n -k 3n sites.txt > sites-sort.txt

The -k option tells sort which column to sort by. By default, sort uses alphabetical sorting in which the string “1000” comes before “11”, which is not our aim. We therefore use -k 2n, with the n indicating numerical sort. If sort receives multiple -k options, it applies the sorting in the given order. Thus, the above command sorts first on column 2 using a numerical sort followed by a numerical sort on column 3.

Although bioinformatic data are often sorted before we analyze it, sort is essential to using uniq, join, (each discussed below) and other shell tools such as comm.

Getting unique lines and counting with `uniq`

By default, the uniq command gets a list of the unique lines from sorted input. Run in this way, this command isn’t very useful for bioinformatics, but it has several options that are helpful. Suppose we have a file call “samples.ind” with a list of samples, and that column 3 gives the population of each sample. (This is the format of Eigenstrat .ind files.) If we want to get a count of the numbers of samples from each population, we can do:
awk '{print $3}' samples.ind | sort | uniq -c

Here, the -c option counts the number of copies of each unique line that is input to uniq, and we use awk to extract the populations, thereby getting counts of the number of samples from each. This example also shows piping, which was introduced in part 1, but here we conveniently pipe between multiple commands in order to avoid writing intermediate files.

Set operations on files with `uniq`

Now suppose we have a set of markers in “data1.bim” and “data2.bim” with marker ids in column 2. We could determine the set of markers that are shared in common between these two files—assuming they use the same marker ids for the same variants—in the following way:
cat data1.bim data2.bim | awk '{print $2}' | sort | uniq -d > intersection

Here, we use cat to concatenate the two files, extract column 2 from this input, sort this column—the set of marker ids—and pass them to uniq -d. The -d option causes uniq to print a line only if the given line appears more than once. As long as the marker ids appear only once in each file, duplicates will only occur for markers in both datasets.

The above is an example of generating the intersection of the two marker sets. To perform set union, we can do the following:
cat data1.bim data2.bim | awk '{print $2}' | sort | uniq > union

We may wish to determine the set of markers that are in “data1.bim” but not in “data2.bim”; i.e., to perform set subtraction. We can do this using a two step process: first compute the intersection, which we did above. With the intersection stored in a file called “intersection”, we can now determine the markers that are unique to “data1.bim” in the following way:
awk '{print $2}' data1.bim | cat - intersection | sort | uniq -u > data1-only

The above extracts the marker ids from “data1.bim”, concatenates them with the markers from the intersection, sorts these (necessary to use uniq), and then passes them to uniq -u. The -u option causes uniq to print only lines that appear once. Since the marker ids in “intersection” are guaranteed to be in “data1.bim”, they will appear more than once to uniq and will not be printed. The suppressed sites are exactly those that are also in “data2.bim”, so this produces the set subtraction we desire. (Note that for many shell utilities, ‘-‘ represents stdin instead of a file, so the piped input from awk gets concatenated with “intersection” in the second part of the command.)

Combining information across files using `join`

Consider a situation in which two files give information about the same set of samples. We may wish to combine these into one file. Here we can use join, which matches fields between two files, producing lines that combine all fields from the two lines that match between the input files.

This is best illustrated by example. Suppose we have files “BMI.txt” and “height.txt” which both contain two fields: the sample id and the given phenotype (BMI and height for the two respective files). join requires that the input files are sorted for the field that is joined on, so we can do the following:
sort -k 1 BMI.txt > BMI.sort.txt sort -k 1 height.txt > height.sort.txt join -j 1 BMI.sort.txt height.sort.txt > BMI_height.txt

The -j 1 option tells join to find matching strings in field 1 in each of the input files. Note that -j 1 is the default, so this last line could also have been
join BMI.sort.txt height.sort.txt > BMI_height.txt

Often the fields that we wish join to match on are different between the two files. Consider a file “pheno.txt” that contains many fields (e.g., phenotypes), with the sample id in column 3. If we wish to create a new file that adds the BMI information to that contained in “pheno.txt”, we could do the following:
sort -k 3 pheno.txt > pheno.sort.txt join -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_BMI.txt

The -1 3 -2 1 options tell join to identify matches using field 3 within the first supplied file—here that’s “pheno.sort.txt”—and using field 1 within the second supplied file.

The resulting output can shift the field order somewhat: the matching field always (a) occurs once even though it is present in both fields, and (b) is field 1 in the output from join. Other than this shift of the match field, the other fields from file 1 occur in their original order, and these are followed by the fields from file 2 (omitting the match field) in their original order.

By default, join prints only lines that have a matching field in both files. Thus, it performs an intersection on an arbitrary field, which can be very convenient in general. However, sometimes we wish to retain all the information from one of the files.

If, in the above example, “pheno.txt” contained information on more samples than appear in “BMI.txt”, and we wish to retain all the information in “pheno.txt”, we could do the following:
join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt > pheno_all_BMI.txt

The -a # option tells join to print all lines from the indicated file number (1 or 2), including those that do not match any line from the other file. Note that any unmatched lines printed as a result of -a # will have fewer fields than the other lines. If we know how many fields occur on matched lines, we can use awk to fill in, as in the following adaptation of the above:
join -a 1 -1 3 -2 1 pheno.sort.txt BMI.sort.txt | awk '{ if (NF < 10) { print $0,NA } else { print } }' > pheno_all_BMI2.txt

This assumes that there are 10 fields for lines that successfully join “BMI.sort.txt” with “pheno.sort.txt”. Recall from part 1 that the awk variable NF gives the number of fields on each line.

Reverting to the original line order after using `join`

Using join on files that have a specific order that must be maintained presents a problem. For example, a PLINK .fam file describes the order of samples in a corresponding .bed file, so we cannot reorder the .fam file. A workaround is to keep a field that lists the original line number and to numerically sort by this field once we finish modifying the lines.

For example, suppose “data.fam” is a PLINK .fam file that does not have phenotype information in column 6. If we have a separate file “case-control.txt” that contains the sample ids and the (1 or 2) case-control status of these samples, we can do the following (note that the sample ids in the PLINK .fam file are in column 2):
sort -k 1 case-control.txt > case-control.sort.txt awk '{print $0,NR}' data.fam | sort -k 2 | join -1 2 -2 1 - case-control.sort.txt | sort -k 7n | awk '{print $2,$1,$3,$4,$5,$8}' > data-pheno.fam

The second line is long, but it does what we want. We first add a field with the line number (the NR variable in awk) to the .fam data; this gives us 7 fields. Next, we sort these lines on column 2, the field containing the sample id which we wish to join on. Then we run join, indicating that field 2 is to be matched from the lines coming from stdin (the ‘-‘ file indicates stdin should be used for file 1), and matching on field 1 for “case-control.sort.txt”. This gives us all the information we need, but the lines are out of order compared to the original “data.fam” and have extra fields. We fix this by numerically sorting on column 7, which gives the original line numbers, putting the lines back in their original order. Lastly, we pipe this to awk and reorder the columns. As noted above, join reorders the fields slightly, with field 1 being the matched field. In this case then, field 1 is the sample id, which needs to be field 2 in the resulting .fam file, so we print field 2 first, which is the PLINK format family id, followed by field 1, the sample id. We know the family id has to be field 2 since the fields from each file maintain their order (with the exception of the match field). We can then print fields 3, 4, and 5 in the same order in the output. Field 6 in PLINK fam files gives the case-control phenotype. This is the information we used join to get from “case-control.txt”, and it must be the last field: field 8, since the input before running join had 7 fields. And there we have it, a new .fam file with case-control status added from a separate file.

Redirecting output in bash: process substitution

In many examples above, we used a separate sort command to produce a file that we then used with join. It is possible to sort both files on the fly using one line in bash, as we explain below, a convenience that allows us to avoid storing the sorted files that we likely don’t need. This works using a syntax that resembles I/O redirection (i.e., < and >) and is called process substitution.

For the first join command given above, we first generated files called “BMI.sort.txt” and “height.sort.txt”. Assuming we don’t need these sorted files after the join command, what we really want to do is to give the output of both these commands to join directly. This is possible in bash using the following command:
join -j 1 <(sort -k 1 BMI.txt) <(sort -k 1 height.txt) > BMI_height.txt

These process substitutions—the two sort commands given in the line above—can be used in contexts where a filename would otherwise appear. (In fact, bash creates a temporary file descriptor with the contents produced by the process within the <(...) construct.)

Counting with `wc`

The final example is of a simple command: wc for word count. This simple program lists the number of lines, words (defined as a sequence of characters delimited by white space), and characters for an input file. Often line counts are useful in bioinformatics work, and wc -l gives only the line count of the input and is slightly more efficient than using wc alone.

Given PLINK format files “data.fam” and “data.bim”, a quick way to determine the number of samples and the number of markers contained in the dataset is to use the following two commands:
wc -l data.fam wc -l data.bim

wc is also useful in combination with the set operations mentioned above, where we can pipe the output of these operations to wc -l to get counts in lieu of producing files when these are not needed.

Retaining shell sessions via `screen`

A useful utility in the context of shell-based work in Linux/UNIX is the program screen. screen has many features, but the feature that is perhaps most useful is its ability to keep interactive shells running even after a user has logged out. screen gives the ability to log into a remote system, do interactive work in a shell, and disconnect while retaining the shell session. Thus, if you’re connected to a computer and started a long-running task, you can safely disconnect so long as that task is running within screen.

To begin, simply run
screen

on the command line. This will start a new shell (most likely bash) within screen. From here you can do the work as normal. Later, when you wish to disconnect while leaving the bash instance running, type Ctrl-a d. This will disconnect you from the screen instance while leaving it running in the background. When you return to the work, you can reconnect to this shell (assuming you have only one screen session running) by executing
screen -r

This will reattach the background screen session to your current terminal and you can proceed with the work you were doing. Note that a nice advantage of this is that you can start screen on one node, say a login node for a cluster, and then within the screen session, connect to another machine. When you detach from the screen session, you will be back on the login node where screen is running, but the shell within that screen instance will still be connected to the other machine.

You can run as many screen instances as you wish, but reattaching when there are multiple sessions requires specifying which session you want. To list all the running screen sessions, type
screen -ls

In general, with multiple screen instances running, you can reattach a session with the command
screen -r -d [pid.tty.host]

where the [pid.tty.host] is that listed by screen -ls. The -d option detaches the indicated screen session if it is attached to another terminal. The latter can happen when you are unexpectedly disconnected from the internet or if that session is attached on some other terminal (possibly on another computer).

While screen offers a number of features, these are the most basic. More information is given in the Screen User’s Manual. Note that within a running screen session, typing Ctrl-a starts to issue commands to screen itself. Thus, Ctrl-a will not move your cursor to the beginning of the line as it does in a standard bash session (users that have a keyboard with the Home key can press this instead).

Searching files with grep

Searching files with sort

Getting unique lines and counting with uniq

Set operations on files with uniq

Combining information across files using join

Reverting to the original line order after using join