Shell scripts

Overview:

  • Teaching: 30 min
  • Exercises: 15 min

Questions

  • How can I save and re-use commands?

Objectives

  • Write a shell script that runs a command or series of commands for a fixed set of files.
  • Run a shell script from the command line.
  • Write a shell script that operates on a set of files defined by the user on the command line.
  • Create pipelines that include shell scripts you, and others, have written.

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.

Not only will writing shell scripts make your work faster — you won’t have to retype the same commands over and over again — it will also make it more accurate (fewer chances for typos) and more reproducible. If you come back to your work later (or if someone else finds your work and wants to build on it) you will be able to reproduce the same results simply by running your script, rather than having to remember or retype a long list of commands.

Let’s start by going back to proteins/ and creating a new file, middle.sh which will become our shell script:

jupyter-user:$cd ~/IntroShell/data/shell-lesson-data/exercise-data/proteins
jupyter-user:$nano middle.sh

As we have seen the command nano middle.sh opens the file middle.sh within the text editor nano (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file – we’ll simply insert the following line:

head -n 15 octane.pdb | tail -n 5

This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Then we save the file (Ctrl-O in nano), and exit the text editor (Ctrl-X in nano). Check that the directory proteins now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Then we save the file (Ctrl-O in nano), and exit the text editor (Ctrl-X in nano). Check that the directory proteins now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

jupyter-user:$bash middle.sh
ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

Information

We usually call programs like Microsoft Word or LibreOffice Writer “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses .docx files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters and doesn’t mean anything to tools like head: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than typing the command out again in the shell and executing it with a new file name. Instead, let’s edit middle.sh and make it more versatile:

jupyter-user:$nano middle.sh

Now, within nano, replace the text octane.pdb with the special variable called $1:

head -n 15 "$1" | tail -n 5

Inside a shell script, $1 means ‘the first argument on the command line’. We can now run our script like this:

jupyter-user:$ bash middle.sh octane.pdb
ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

Or on a different file like this:

jupyter-user:$bash middle.sh pentane.pdb
ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00
ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00
ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00
ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00
ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00

Information

For the same reason that we put the loop variable inside double-quotes, in case the filename happens to contain any spaces, we surround $1 with double-quotes.

Additional Arguments

Currently, we need to edit middle.sh each time we want to adjust the range of lines that is returned.

Let’s fix that by configuring our script to instead use three command-line arguments. After the first command-line argument ($1), each additional argument that we provide will be accessible via the special variables $1, $2, $3, which refer to the first, second, third command-line arguments, respectively.

Knowing this, can you edit middle.sh so that we can use arguments to define the range of lines to be passed to head and tail respectively?

Solution

This works, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:

jupyter-user:$nano middle.sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"

What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type:

jupyter-user$ wc -l *.pdb | sort -n

because wc -l lists the number of lines in the files (recall that wc stands for ‘word count’, adding the -l option means ‘count lines’ instead) and sort -n sorts things numerically. We could put this in a file, but then it would only ever sort a list of .pdb files in the current directory.

If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use $1, $2, and so on because we don’t know how many files there are. Instead, we use the special variable $@, which means, ‘All of the command-line arguments to the shell script’. We also should put $@ inside double-quotes to handle the case of arguments containing spaces ($@ is special syntax and is equivalent to "$1" "$2" …).

Here's an example:

jupyter-user:$nano sorted.sh
# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n

Now lets try running it:

jupyter-user$ bash sorted.sh *.pdb ../creatures/*.dat
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
163 ../creatures/basilisk.dat
163 ../creatures/minotaur.dat
163 ../creatures/unicorn.dat
596 total

History

Suppose we have just run a series of commands that did something useful — for example, that created a graph we’d like to use in a paper. We’d like to be able to re-create the graph later if we need to, so we want to save the commands in a file. Instead of typing them in again (and potentially getting them wrong). We can use the history command to achieve this. The history command prints to the terminal the last commands you have run.

Try it now.

jupyter-user:$history

Say our last few commands did something useful we wanted to save. Lets look at the last 6 commands

jupyter-user:$history | tail -n 5
297 bash goostats.sh NENE01729B.txt stats-NENE01729B.txt
298 bash goodiff.sh stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
299 cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
301 history | tail -n 5

We can then redirect this to a file, and with a small amount of work to remove the serial numbers and the last line we have an accurate record of the commands we used.

jupyter-user:$ history | tail -n 5 > recent.sh

Nelle's Pipeline: Creating a Script

Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script.

First we return to Nelle’s project directory:

jupyter-user:$cd ../../north-pacific-gyre/

She creates a file using nano

jupyter-user:$nano do-stats.sh

containing the followingL

# Calculate stats for data files.
for datafile in "$@"
do
    echo $datafile
    bash goostats.sh $datafile stats-$datafile
done

She saves this in a file called do-stats.sh so that she can now re-do the first stage of her analysis by typing:

jupyter-user:$bash do-stats.sh NENE*A.txt NENE*B.txt

She can also do this:

jupyter-user:$ bash do-stats.sh NENE*A.txt NENE*B.txt | wc -l

so that the output is just the number of files processed rather than the names of the files that were processed.

Variables in shell scripts

In the proteins directory, imagine you have a shell script called script.sh containing the following commands:

head -n $2 $1
tail -n $3 $1

While you are in the proteins directory, you type the following command:

bash script.sh '*.pdb' 1 1

Which of the following outputs would you expect to see?

  1. All of the lines between the first and the last lines of each file ending in .pdb in the proteins directory
  2. The first and the last line of each file ending in .pdb in the proteins directory
  3. The first and the last line of each file in the proteins directory
  4. An error because of the quotes around *.pdb

Solution

Longest file

Write a shell script called longest.sh that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. For example:

jupyter-user:$bash longest.sh shell-lesson-data/exercise-data/proteins pdb

would print the name of the .pdb file in shell-lesson-data/exercise-data/proteins that has the most lines.

Feel free to test your script on another directory e.g.

jupyter-user:$ bash longest.sh shell-lesson-data/exercise-data/writing txt

Solution

Key points

  • Save commands in files (usually called shell scripts) for re-use.
  • bash [filename] runs the commands saved in a file.
  • $@ refers to all of a shell script’s command-line arguments.
  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.
  • Place variables in quotes if the values might have spaces in them.