# Introduction to Linux and working on the command line

**Contents**

**1 First steps**   
  - [1.1 A brief history of Unix and Linux](#a-brief-history-of-unix-and-linux)  
  - [1.2 What is the Linux shell?](#what-is-the-linux-shell)  
  - [1.3 Basic concepts and definitions](#basic-concepts-and-definitions)  
  - [1.4 Connecting to remote computers](#connecting-to-remote-computers)  

**2 Basic commands**  
  - [2.1 Navigating the file system](#navigating-the-file-system)  
  - [2.2 Editing, inspecting, and searching within text files](#editing-inspecting-and-searching-within-text-files)  
  - [2.3 Search, replace, and write output to a new file](#search-replace-and-write-output-to-a-new-file)  
  - [2.4 Combining multiple commands into scripts](#combining-multiple-commands-into-scripts)  

**[3 Additional reading](#additional-reading)**

# 0 Learning goals

- Today we will learn how to work in command-line interaces of Unix based systems.
- We will teach you the basics of navigating the file system, viewing and searching large data files.
- You will also learn to chain commands together to make your work quicker and more reproducible.
- We will introduce ways on how to think about your data analysis with powerful command line tools. 

**A great way to learn and extend your knowledge about the command line is to search online for instructions and then try them yourself, experimenting as much as you can.**

The computer is our lab and we want to use it as efficiently as possible.

# 1 First steps

## 1.1 A brief history of Unix and Linux

Unix is an operating system first developed in the late 1960s at AT&T's Bell Labs by [Ken Thompson](https://en.wikipedia.org/wiki/Ken_Thompson), [Dennis Ritchie](https://en.wikipedia.org/wiki/Dennis_Ritchie), and others. It was designed to be portable (could be adapted quickly to different hardware), multi-tasking (can run multiple tasks simultaneously), and multi-user (yould be used by multiple people at the same time). Unix was written in a high-level programming language ([C](https://en.wikipedia.org/wiki/C_(programming_language))) which was a revolutionary concept at the time as, until then, operating systems were usually written in [assembly](https://en.wikipedia.org/wiki/Assembly_language) language. This made Unix easy to modify, expand, and port to other machines.

Linux is a Unix-like operating system that came into existence in the early 1990s when a Finnish student, [Linus Torvalds](https://en.wikipedia.org/wiki/Linus_Torvalds), started a project to create a new free operating system kernel. Unix variants at the time were all proprietary. Torvalds released the initial code on the internet and invited others to contribute. This collaborative, open-source approach allowed Linux to grow rapidly.

Linux is based on the principles and design of Unix however it is built from scratch by a community of developers worldwide, led by Torvalds. It is free to use and distribute, which has led to its widespread use across personal computers, servers, mobile devices, and more. Over time, groups of developers have packaged the Linux kernel with a variety of software to create complete operating systems, known as distributions (distros), like Ubuntu, Fedora, and Debian.

The similarities between Unix and Linux boil down to their shared design philosophies, use of a common command-line interface (CLI), and similar system architecture. However, they differ in their licensing, with Unix often being proprietary and Linux being free and open-source. Both operating systems have had a profound impact on the computing landscape, influencing the development of various systems and applications that we use today.


## 1.2 What is the Linux shell?

The Linux shell,  allows users to interact with the computer's operating system through text commands also refered to as a command-line interface (CLI). This might seem daunting to novices initially, especially in an era dominated by graphical user interfaces (GUIs) with mouse control. However, the shell is a powerful tool that offers precision, control, and a deeper understanding of how computers work. In its essence it is a programm expecting and executing commands, but it can alsoalso  do a lot more than that. 

The Unix operating system, the progenitor to Linux (see above), came with its own shell, the Bourne shell. With time, various other shells were developed, but the most popular in the Linux world is called `Bash`, which stands for 'Bourne Again SHell.' Bash is an enhanced version of the original Bourne shell, incorporating new features and improvements to make it more usable and powerful.

Why use the Linux shell? For starters, it's incredibly efficient for repetitive tasks. Complex operations that might require lots of dragging and clicking in a GUI can often be done with a single command. The shell also excels in scriptability; users can write scripts (essentially a list of commands) to automate a wide array of tasks.

The power of the shell comes from its ability to harness the capabilities of the Linux operating system, where even the most fundamental aspects like file management and software installation can be controlled with precision through shell commands. While graphical interfaces provide a user-friendly layer on top, the real machinery that performs the heavy lifting works under those graphical programs, is accessible almost  exclusively through the shell.

Learning to use the Linux shell can seem like learning a new language — because it is. But just as knowing the basics of a language helps in a foreign country, knowing basic shell commands is incredibly beneficial to navigate and utilize the full potential of Linux-based systems. As we make our first steps into this world, after a little practice, we will gain a level of control and efficiency never imagined possible.

## 1.3 Basic concepts and definitions

This introduction includes a large number of examples.
The \$ symbol in the examples below indicates what is called the [*command prompt*](https://en.wikipedia.org/wiki/Command-line_interface#Command_prompt), rather
than something you type. This may, on some systems be prefixed with additional text, or be formatted differently.\
You will see something like `symbiont@lichengenomics:\~\$` which refers to the user you are
logged in as, before the @ symbol, the name of the computer you are
using and the current directory, after the : symbol. The tilde \~
signifies your home directory (more on that in section 2.1 below).

To execute commands in the shell you type your command and hit "Enter"
(Return) to execute the command. The output will appear in the same
window below the command prompt.

The exercises are split into sections of text with explanations of what we are doing and so-called code blocks which contain the actual commands we will be running:

```
# This is an example code block. Typically you can copy and paste what is here and execute it by pressing <Return>.
```

## 1.2 Connecting to remote computers

### Theory: What is SSH?

SSH, or Secure Shell, is a network protocol that allows you to securely access a computer over an unsecured network. Imagine you're in your home, and you want to send a secret message to a friend in a house across the street. Rather than shouting the message out loud where anyone could hear, you use a special secure tunnel that only you and your friend can use. This is what SSH does for computers.

When you use SSH, it's like creating a secure tunnel for your data. You can run commands on a remote computer as if you were sitting right in front of it, even if it's on the other side of the world. This is particularly useful for large-scale data analysis, performing techical tasks, or transferring files securely. To connect to a remote computer with SSH, all you need is the SSH software on your own computer, the remote computer's address, and the appropriate access permissions. With these essentials, SSH ensures that your connection and data stay encrypted and safe from eavesdroppers.

Since we will be analyzing very large datasets, our own laptops and desktop computers typically do not have enought memory or CPU power. Hence we will be working on 
the University of Graz High-Performance Compute Cluster - [GSC](https://hpc-wiki.uni-graz.at/) which we access through SSH. Using our UGO Account and passwords you should be able to log in.
Open a terminal window on your system and type:

```
$ ssh <username>@gsc.uni-graz.at
<omitted long output here>
----------------------------------------------------------------------------------
Last login: Fri Apr  4 13:29:38 2025 from 142.55.237.189
username@IT010044: ~ $ 
```

Congrats, you have successfully executed your first command line program which should log you in to the Uni Graz cluster. Mind you that this only works from inside the University network or through a VPN connection.

# 2 Basic commands

```{important}
Before you proceed type (or copy-paste) the following text into your shell and hit Return (Enter). This will automatically download the input data for this course.
We will discuss later what exactly this command is doing.
Run this: `$ git clone https://github.com/reslp/linux-intro.git`
```


## 2.1 Navigating the file system

Once logged in, the first thing to do is to learn the basics of moving between
directories on your computer, checking where you are, checking what
files are present and having a quick look at them. Some of the
terminology is perhaps slightly new (*directories* rather than
*folders*) but using the right words will mean you are speaking the same
language as everyone else and make your life easier when Googling for
solutions.

### 2.1.1 See where you are and how to move between directories

At the command line you need to know "where you are" i.e. which
directory you have open and are working in. Question: If you
issue the command to 'list all files' which files will be listed?
Answer: Those in the folder where you are currently working, called the
*working directory*. 

The command to find out where you are
is `pwd` short for *print working directory*. 

*print* when working at the command line
means *display on the screen* rather than *write this to a piece of
paper or a file*.

Type the command to display your current directory now (and hit Enter)

You should see something like this, with your command on the line
beginning with the \$ prompt and the output on the line below

```
$ pwd
/usr/people/EDVZ/username
```

But maybe that isn't where you want to be, in which case you need to
`change directory` and the command for that
is `cd`.

```
$ cd linux-intro
$ cd data
$ pwd
/home/symbiont/linux-intro/data
```

You can see that the / symbol denotes levels of directories, so that
`data` directory is contained within the `linux-intro` directory
which is within the `symbiont` user home directory in the system's
`home` directory. These *file paths* can be long sometimes but they are
always explicit, which is a very good thing for reproducibility. There
is no excuse for trying to remember where the data was stored for your
analysis, here it is written out, and you will want to record this as
part of your experiment.

You can go up one level (to the directory containing your current
working directory) by using double dots (ensure there is always a space
between the `cd` command and the directory you wish to go to).

```
$ cd ..
$ pwd
/usr/people/linux-intro/
```

What happens if you use the cd command without telling the system where
you would like to change directories to? Try it. How can you find out
which directory you are now in?

```
$ cd
$ pwd
/usr/people/username
```

Using cd command on its own returns you to your user's home directory,
in this case `/usr/people/symbiont` from wherever you are.

The tilde symbol (**\~**) is shorthand for this home directory so
`/home/symbiont/linux-intro` and `~/linux-intro` refer to the same
directory, which saves a little typing.

The exact location of your home directory will depend on the flavor of Linux/Unix you are using.
On Apple MacOS, for example, it would be something like `/Users/username`.

Another very useful shortcut is `cd -` (dash) which takes you to the
previous directory that you were in. This is really useful when you need
to move directly between directories that are separated by several levels or that
have long names.

```
$ cd ~/linux-intro/data
$ pwd
/usr/people/username/linux-intro/data
$ cd
$ pwd
/usr/people/username
$ cd -
$ pwd
/usr/people/username/linux-intro/data
$ cd -
$ pwd
/usr/people/username
```

Make sure that you are actually typing this out for yourself rather than
just reading along. This *active learning* will really help it to stick
in memory, and come back when you need it, a bit like developing muscle
memory.

Above we were issuing commands one at a time; first `cd` then `pwd`. To
chain commands together on the same line separate them with a semicolon
ie `cd;pwd`.

Try the exercise above again but using semicolons. You should now only
need 4 commands not 8.

One of the underlying principles of UNIX type operating systems (also called the [UNIX philosophy](https://en.wikipedia.org/wiki/Unix_philosophy)) is that
every program should do one thing (but this very well) and that it should be easy to
join them together (usually with `;` or a pipe `|` described
later). Think of it like a set of lego blocks that we may put together so we
can build anything we want. Simple units, building
complex and impressive outcomes.

We have prepared example data to be used in the following exercises in a
directory on your computer - `linux-data/`.

**Excercise**:

Navigate to this directory and then confirm that you are in the right
place by printing the working directory path to your screen. Then go
back to your home directory before returning to the directory above yet 
again to practice your new skills. Test yourself first, but it's OK to
review different ways to do this again from the manual above. Looking
things up is not cheating.

#### 2.1.1.1 Ways to return home

There are 5 ways that you should now know to return to your home
directory. Try each to show that they work and discuss with others, 
most people only get 2 or 3.

<details>
  <summary><b>Solution:</b> Five ways to return home</summary>

  ```
  $ cd /usr/people/username
  $ cd 
  $ cd ~
  $ cd ..;cd ..
  $ cd -
  ```
</details>

### 2.1.2 Listing the contents of a directory with ls

A very common thing you will want to do is to display the contents of a
directory, i.e. list all the files. You can list the files (and
directories) in your working directory using the ls command. For this
exercise we will be using the raw data folder.

```{caution}
Spaces in file and directory names cause difficulties as the shell treats spaces as the end of a file name. When looking for `my file` it complains that it can't find `my`.
```

Look here:
```
$ cd my directory
-bash: cd: my: No such file or directory
```

Although this can be got around by using quotes `cd 'my file'` replacing
with underscores (e.g. `my_file`), hyphens (e.g. `my-file`), or
concatenating the words (e.g. `myfile`) are usually better ways to work.

```
$ ls
```

**Exercise:** List files in long format. This requires changing the *behavior* of the command and it introduces a new concept called *command-line flags*.

```{tip}
You can Google to find out what all the data listed means, or better use the built in manual (`man`) pages.
```

```
$ man ls
```

Hit the spacebar to advance through the pages. Typing q (quit) will get
you out of a man page. You can use man with any command, not just `ls`.

```
$ man man
```

**Googling is not cheating**, it is a great way to learn and is highly
recommended.

### 2.1.3 Copying and moving files with `cp` and `mv`

Once you know how to navigate between directories and list their
contents, the next common task is to copy or rename files. In biology we
very often want to keep an original data file unchanged while making a
working copy for practice or further analysis.

The command `cp` means **copy**. It takes a source file and creates a
second file with a new name or in a new location.

```
$ cd ~/linux-intro/data
$ cp scaffold.fas scaffold_copy.fas
$ ls
```

After this command both files exist: the original `scaffold.fas` and the
new copy `scaffold_copy.fas`.

The command `mv` means **move**. It is used both for moving a file to a
new directory and for renaming a file.

```
$ mv scaffold_copy.fas scaffold_practice.fas
$ ls
```

Here the file stayed in the same directory, so `mv` simply renamed it.
If we give a different destination path, the file is moved there:

```
$ mv scaffold_practice.fas ~/
$ ls ~
```

Now `scaffold_practice.fas` is no longer in `~/linux-intro/data`; it has
been moved to your home directory.

```{caution}
`cp` creates a second copy of a file. `mv` does not: it relocates or renames the existing file.
```

```{note}
To copy a whole directory and everything inside it you need the recursive flag `-r`, for example `cp -r fasta-to-combine fasta-to-combine-copy`.
```

**Exercise:** Copy the file `scaffold.fas` to a new file called
`scaffold_backup.fas`. Then rename that new file to
`scaffold_backup_renamed.fas`. Finally move it to your home directory.
Use `ls` to confirm at each step what happened.

### 2.1.4 Some things you will have noticed 

Firstly, you have to type very carefully, any typo 
will result in an error saying that the file or directory doesn't exist, e.g:

```
$ cd liinux-intro
bash: cd: liinux-intro: No such file or directory
```

A second thing you will have noticed is that some file
names are long, complex, and difficult to type without errors.

```{tip} 
You need to learn to use the *tab* key to autocomplete names.
```

Real command line gurus use the tab key extensively. If you start typing
the command and then hit tab the filename will be auto completed, or, if
you haven't typed enough yet to specify a single file (it could be one
of several beginning with the same letters) you will probably get a
beep, followed by a list of files or directories beginning with those
letters. It will also autocomplete the portion of the file or directory
name that is shared between them all and wait for you to type more and
hit tab again.

Try it now. Navigate to `linux-intro/data` and list the files
present. You should have explored using tab to autocomplete the
directory names at every level. If not quickly jump back using `cd -` and
try again.

## 2.2 Editing, inspecting, and searching within text files

### 2.2.1 Editing files

UNIX based systems provide several powerful utilities for editing and
inspecting files, either from the command line, or in a simple graphical
user interface. You can use the gedit program in Linux to open files for
viewing and editing in a graphical way, much like **Notepad** or
**BBEdit** in Windows or MacOS. Don't start doing this though, it
has limitations and is a waste of your time on this course, learn the
command line instead. Here we will use the simple text editor nano there
are many others beyond the scope of this tutorial. The strength of nano
lies in its simplicity. Help: [a beginners guide to nano](http://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/)

To view our **scaffold.fas** file in nano you should use the general
unix approach of program-name filename, and assuming we are still in the
correct directory:

```
$ nano scaffold.fas
```

Did you tab complete the name? If not exit using Ctrl+X and try again.
*Reinforce your skills*

**TASK**: To practice nano rename the sequence to `>fungal_scaffold` save
the changes using Ctrl-O (called writing-**O**ut), it will ask you if
you want to save to the same file, don't, give a new informative name
like:

```
scaffold_renamed.fas
```

```{tip}
To close nano you should use the Ctrl+X key combination (see
```

Ctrl key). You can find [nano helppages](http://mintaka.sdsu.edu/reu/nano.html) with a Google
search. You can also use `man nano`.

To close less (below) you should just type `q`. Using Ctrl+X or q will
generally close most UNIX programs, if either of those don't work you
can also use the Ctrl+C key combination to *kill* the program and return
to the command prompt.

```{note}
There are many more text editors for the terminal, which are much more powerful than nano. If you are interested in these check out: [vim](https://www.vim.org/) or [emacs](https://www.gnu.org/software/emacs/)
Theses editors have a steep learning curve but learning how to use them properly is highly rewarding.
```


### 2.2.2 Inspecting files

`nano` allows us to edit the file *in situ*, however, if we just want to
*inspect* the file there are several other UNIX tools which display a file's content on screen:

1.  `cat` to print the whole file to the screen

2.  `less` to print the file a page at a time to the screen

3.  `head` to print the first few lines of the file on screen

4.  `tail` to print the last few lines of the file on screen

This is immediately relevant for us since one of the problems with DNA sequence files is that they can be large -
several hundred megabytes to a few gigabytes is not uncommon. Viewing
these files can be difficult, as the files need to be loaded into
memory, and can therefore take a great deal of time for the text editor
to read from the disk. The `less`, `head` and `tail` commands are very
efficient for viewing large files such as these.

**TASK:** All these commands will be useful for you during this course.
You should now try using less, head, tail and cat to practice seeing the
text file `parmelia_sequences.fas` which can be found in
`~/linux-intro/data/raw_data/fasta/`. Are you using `ls` to see what is available and tab to
complete the filename? How can you step through the file a screen at a
time using less? Try Googling for the answer and demonstrate that it
works.

### 2.2.3 Searching within files

**Searching** within very large files however can be even more
troublesome, especially using the standard find functions in a text
editor, which aren't optimised for performing searches across very large
files. For this reason various tools have been created that allow users
to search within large files from the command line, and are highly
optimised for their function. One of the most useful utilities for
searching within a file is `grep` (**g**lobal **r**egular **e**xpression
**p**arser).

`grep` is very simple to use. At the command line you will need to type
the word grep, followed by the text you are searching for, followed by
where (the filename) to look for it. For example, to search for the word
RPB1in our `parmelia_sequences.fas` file we do the following:

```
$ grep "RPB1" parmelia_sequences.fas
```
This returns all the lines containing the word *RPB1*.

We can count how many of the sequences are RPB1 by using the *count*
flag (`-c`) with `grep` as follows:

```
$ grep -c "RPB1" parmelia_sequences.fas
```

### 2.2.4 How many sequences do I have?

A very common question to ask is:  *how many sequence records are in
this enormous fasta file?*

```{note}
FASTA files are a common text-based files to store nucleotide and amino acid sequences. Look [here](https://en.wikipedia.org/wiki/FASTA_format) for more details. 
```

You could of course search for all the
greater than `>` symbols, which is almost certainly the number of
records. However you should really search for all the lines **starting
with `>`** rather than the number of times it occurs, as [it is possible
for a fasta header to contain an internal \>](https://nsaunders.wordpress.com/2014/08/14/looking-for-in-all-the-wrong-places/)
. 'Line starts with' is represented by the \^ symbol.

### Task

Try to write a grep search to count the number of fasta header
lines. Google/ask for help. **Have you remembered the quotation marks
around the search phrase?** Unfortunately your solution will probably
delete the data file if you forget the quote marks! Why? Discuss

Search the two files `scaffold.fas` and `parmelia_sequences.fas` you have
already used to determine the number of sequence records. Discuss your solution with your neighbor.

```{tip}
you can use the up and down arrows to cycle through your
command history. If you find yourself typing the same command then try
pressing the up arrow until you reach the command you want. You can
always edit that command if you need to, perhaps using tab to
autocomplete a new file name. Use the down arrow to bring back more
recent commands, and eventually the command line will clear completely,
i.e. you are back to the present 'no command'. Type history to see a
list of all your previous commands or Ctrl-R to search them.
```

## 2.3 Search, replace, and write output to a new file

`grep` is an excellent tool for undertaking simple yet fast searches
within text files. But to search (and replace) within a text
file, or to redirect changes to a new file, we will need to use another tool (OK there are actually numerous ways of doing
this utilizing other tools but this manual will only deal with *simple*
examples with sed or python scripts).

### 2.3.1 sed the stream editor

`sed` works best when we need to deal with files as single lines, or rows
of text data. Since `sed` doesn't try to take the whole file into memory,
instead dealing with a line at a time, it has real advantages when files
are huge - as they often are for sequence data.

To search for and replace `RPB1` with `RPB_1` in our
`parmelia_sequences.fas` file we could do the following:

```
$ sed 's/RPB1/RPB_1/' < parmelia_sequences.fas > RPB1toRPB_1.fas
```

This will replace the single word `RPB1` we identified using grep with
the word `RPB_1`, but output these changes to the file
`RPB1toRPB_1.fas`, leaving the original file unchanged. The s within
the single quotes signifies this is a *substitution* command
and the / characters are delimiters that separate the text to search
for, and the text to replace it with. In UNIX based systems the \<
signifies an input, so we are taking input from our
`parmelia_sequences.fas` file and outputting (\>) to
`RPB1toRPB_1.fas`.

 Always give meaningful names to files and directories, even if that
makes them seem long. The person you are doing this for is *future you*
who will remember less than you think. So clear filenames is one of
the ways to make sense of the data, how it has been transformed, and to
help record a reproducible experiment. It is very useful to have a
filename like:

```
whitby-FDS12763-nematode18S-lenfiltered200bp-uniquespecies.fas
```
instead of

```
sequences_2.fas
```

Another reason the information-in-filename approach is very useful is
that it contains a lot of information you can use for analysis. If you
had 1000 files from separate sampling points, you could choose which
files to pull data from based on names like "whitby" or "FDS". If you
wanted to grab data just from enoplid nematodes from only the Whitby
samples you could find and list (con*cat*enate) those
with a search, and pipes `|` to string several jobs together. Below is an
example, these files don't exist here, but you are going to try it
yourself on files that do.

```
cat ~/allsamples/*whitby*.fas | grep enoplida | sort | uniq -c
```


### Task

Google, discuss, and ask until you know what this command
does. To help your searches the asterisks are called 'unix wildcards'
-why are they used? Think for a moment how much work this single line is
actually doing and how long it would take manually?

Above I suggested using the filename

```
whitby-FDS12763-nematode18S-lenfiltered200bp-uniquespecies.fas
```

It may seem an annoying amount of typing to write this much information
in filenames, but it isn't *you* who should be doing the writing, it's
your script. It is a different mindset, but really useful.


### 2.3.2 Additional useful commands: `cut`, `sort`, and `uniq` and how to combine them

There are many small UNIX tools that become very powerful when we join
them together. Three especially useful commands are `cut`, `sort`, and
`uniq`.

- `cut` extracts part of each line
- `sort` puts lines in order
- `uniq` collapses repeated neighbouring lines

To join commands together we use a **pipe**, written as `|`.
A pipe takes the output of the command on the left and sends it directly
into the command on the right. This lets us build small workflows from
simple commands.

For example:

```
$ grep '^>' parmelia_sequences.fas | cut -d ' ' -f 2
```

Here `grep` first finds all FASTA header lines, and the pipe sends those
lines to `cut`. Then `cut` extracts the second space-separated field
from each header line.

To demonstrate these commands we will work with the FASTA header lines
from `~/linux-intro/data/raw_data/fasta/parmelia_sequences.fas`.
First, let us display only the header lines:

```
$ cd ~/linux-intro/data/raw_data/fasta
$ grep '^>' parmelia_sequences.fas
```

Each of these header lines contains pieces of information in square
brackets such as `[gene=mcm7]` or `[protein=beta-tubulin]`.

The `cut` command can split each line at a chosen delimiter and keep
only one field. For example, if we split at ` ` (space) and keep field 2, we get
just the first annotation block from each header:

```
$ grep '^>' parmelia_sequences.fas | cut -d ' ' -f 2
```

Now we can sort these annotations and count how often each one occurs:

```
$ grep '^>' parmelia_sequences.fas | cut -d ' ' -f 2 | sort | uniq -c
```

You can read this command from left to right:

1. `grep '^>' parmelia_sequences.fas` finds all header lines
2. `cut -d ' ' -f 2` keeps only the second field
3. `sort` puts identical entries next to each other
4. `uniq -c` merges identical neighbouring lines and counts them

This tells us how many sequence headers contain each first annotation.
You should see entries such as `gene=mcm7`, `gene=RPB1`, and
`protein=beta-tubulin`.

```{note}
`uniq` only merges identical lines that are next to each other, which is why it is very often used after `sort`.
```

This is a good example of the UNIX philosophy again: each command does a
small job, but together they let us answer a useful biological
question.

### Task

Use `grep`, `cut`, `sort`, and `uniq` on
`parmelia_sequences.fas` to answer the following:

1. How many headers have `gene=mcm7` as their first annotation?
2. How many have `protein=beta-tubulin`?
3. Can you modify the command so that the most common annotation appears at the bottom of the list?

### 2.3.3 echo

A useful way to write to a text file is with echo. This will print to
the screen or a file. Here is a [short
introduction](http://www.computerhope.com/unix/uecho.htm)
to echo if you need it, although the next sections are fairly self
explanatory without it.

```{tip}
 Or use `man echo`
```

Try these commands

```
$ echo Hello world!
$ echo 'Hello world!' > greeting.txt
```

If the file `greeting.txt` does not exist it will be created. If it does
exist it will be overwritten. Check the file now exists (how?), then you
can use one of the commands above (cat maybe? Do you remember the
others?) to inspect the file you have just created. If you wish to
append text to a file rather than replace it you can use the \>\>
symbol:

```
$ echo 'Hello again world!' >> greeting.txt
```
Try this and check your success. Routing syntax (\>, \>\>) is general to
UNIX and can be used with other programs too. Imagine that you need to
add an extra fasta sequence to the end of a big sequence file, the
append symbol \>\> will be helpful. Do you remember that \< determined
the input source? Can you think of any situations in your work where
this UNIX command line approach could save an enormous amount of work in
manipulating files?

`echo` also allows us to format files correctly if we need newlines or
tabs inserted by using the `-e` flag. The tab symbol is \\t and newline
\\n.

Can you use these to better format `greeting.txt`?

Try to imagine what the following command will write & discuss with
others:

```
$ echo -e "column1\tcolumn2\nRNA\tDNA" > rna-dna-columns.txt
```

Check your success. These commands are useful when writing a lot of data
to a file programmatically, and when format such as having a defined
number of columns or lines is important, which is a *very*
common situation for bioinformatics work.  
**Remember this example.**   
You will use echo to create one of these files tomorrow.  

A common bioinformatics task is to concatenate a lot of individual
sequence files into one single file. This is very time consuming to do
in a GUI if you have more than a couple of files to open, copy, close,
open, paste. The task at the command line however *scales* easily from
1 to 1 million files. You already have all the skills to do this.


### Task

Go to the `linux-intro/data/fasta-to-combine` directory. Combine all 10
sequence files into a new file with an informative name. Do not add the
contents of the `readme.txt` file. Demonstrate your success.

Lastly `echo` can write file information like file names

```
$ echo *.fas > fasta-file-names.txt
```

This would write the name of every file in the current directory with a
`.fas` extension to a file called `fasta-file-names.txt` which is often very
useful when you need to record lots of output file information.

## 2.4 Combining multiple commands

### 2.4.1 Text-processing scripts

Often you will wish to do more complex tasks of manipulating text files.
These are best done with simple scripts and most bioinformaticians would
use a python script to do these sorts of tasks. Learning python is not
part of this tutorial (even though we run many python scripts) but there
are many free online courses if you wish to improve your knowledge (e.g.
[pythonforbiologists.com](http://pythonforbiologists.com/)
Google for many, many more).

You have a file `example-rna.fas` in the
`linux-intro/data/backtranscribe` directory which holds [fasta
format](https://en.wikipedia.org/wiki/FASTA_format)
RNA sequences. You need to change these sequences to DNA. You could
search and replace U with T using sed as above (try to write this
command for yourself). Unfortunately that will change every U to a T in
the sequence headers too. Instead a simple, but much more flexible and
intelligent, python script could be used,. This has been written for you
called `RNAtoDNA.py`.

### 2.4.2 Python scripts

### Task

Navigate to the correct directory and identify the python
script. Instructions are in the [Navigating the File
System](#navigating-the-file-system) section above if you
have forgotten.

Have a look at this `RNAtoDNA.py` file using your new command line skills
(see above if you have forgotten). If you don't understand everything, that's OK.
Have a quick guess what some parts might mean and then read on.

Comment lines begin with a hash \# symbol. These are just for humans to
read, they are ignored when the script is executed (run). Scripts with
lots of comments are much easier to understand and you should use them
yourself when you write or modify a script.

You are already familiar with [fasta format
files](https://en.wikipedia.org/wiki/FASTA_format). Given a fasta file containing RNA sequences, how can you convert them into DNA?  
Cou could probably write some
[pseudocode](https://en.wikipedia.org/wiki/Pseudocode)
quite quickly, and one version could look like this:

-   Open the input data file containing RNA sequence

-   Create an output file with a new name to save DNA sequence to

-   If input data line starts with \> its a header line

    -   write header line to output file

    -   move on, we're not changing headers

-   Otherwise its sequence data

    -   change all U \--\> T making it a DNA sequence

    -   write changed line to output file

-   Repeat until end of file, close files, claim success

Now read the python script again, is it more understandable? Some parts
may not be obvious, but it\'s generally like the pseudocode above.
Discuss the script with someone.

### Task
Create a new DNA fasta file from the file `transcripts.fas`
provided in this directory.

In order to run a program or script we specify the program to be run
(python) and the file to be executed (`RNA2DNA.py`).

```
$ python RNA2DNA.py
```

First figure out how to run the script to produce a DNA file. Next,
google how to rename files at the unix command line, and rename the
newly created file to `new-dna.fas` or something even more informative.
NB remember spaces in filenames cause troubles at the command line,
that's why dashes or underscores are commonly used.

```{note}
You have understood and run a python script in a unix shell, to
reformat nucleotide sequence data. Our work here is done, you are now a
bioinformatician, shake hands, welcome to the club!
```

### 2.4.3 Shell scripts- collecting together lots of commands

Similar to python scripts, the UNIX based shell allows us to execute
*shell scripts*, which usually have the .sh extension. Shell scripts
are a powerful way to link together lots of different commands and then
execute (run) them all at once. Below is a walk-through demonstrating a
shell script.

In our 'linux-intro/data' directory we have a shell script: `readmap_all.sh`

Its very easy to get lost or panic in the next paragraphs. *Don't panic!*
Just skim this section and ask someone. It\'s a deliberately complex
example, the point of this exercise is not that you read in detail,
understand exactly, and remember every detail. It is to give you an idea
that

By doing a `cat readmap_all.sh` we can view the content of this file on
screen.

The first line (`#!/bin/bash`) is what is called the *shebang* line and
points to the location of the shell program we wish to use when
executing the program. Any other **lines that begin with a hash (#) are
comment lines** and are ignored by the shell. Comment lines can help in
describing what each part does and are therefore very useful to remind
yourself, and others, what the script was intending to do.

The first command executed is `cd trimmed_reads` which changes the
directory to `trimmed_reads`. It then immediately executes `files=$(ls renamed\_\*R1\*)`.
This lists all the files which have `renamed_*R1*` in
their filenames and saves the list of filenames to the variable named
files. The stars sign (\*) is a placeholder. It means any character(s).
Next there is another cd command which makes the script jump back to the
previous directory. The command echo "Building index file:" outputs this
message to the screen. The next line is another comment . The command
`bowtie2-build ~/data/peltigera/Peltigera_membranacea/additional_data/Pmem_fungal_scaffolds_DNA_cleaned-up/Pmem_mycobiont_scaffold_1line.fasta ./02_bowtie/Pmem_fungal_index.build` calls the program `bowtie2-build`. The
`bowtie2` software is used to [align (map)](https://en.wikipedia.org/wiki/Sequence_alignment) short sequences, usually reads
to much longer sequences such as assemblies. This process is called *read
mapping*. The purpose of this can be manifold. For example read mapping
is needed if you want to dected SNPs or look at expression level
differences in RNASeq experiments. Of course there are many other
applications for read mappping.

The line `echo "building index done"` will inform us that the index file was successfully built. The next two lines create two additional variables: `samfile_base=pmem_readmap_` and `counter=0`. The `samfile_base` variable is used as a prefix for the new file which will be created during the rest of the shell script.

You may have already guessed that this script was written to automatize readmapping for many sequenced libraries against one reference genome. The actual mapping takes place in the next few file. Since we want to do this many times we use a for loop. This loop will repeat the commands between the do and done statement for a specific number of times. In this case it is done for the number of files in the `$files` variable which we created earlier. Inside the `for` loop , the first thing which is done here is to assign several new variables containing the file paths which should be mapped. Can you guess what the `sed` command in the script does? The file names `$d` and `$d2` are then displayed on screen with `echo`. In the following line the counter variable is increase by 1: `((++counter))`. After that a new variable is defined. Can you guess what this variable may contain?

The next command calls `bowtie2` to map the reads in the read files
against the reference: `bowtie2 -p 24 -q --phred33 --fr -x ./02_bowtie/Pmem_fungal_index.build -1 $name1 -2 $name2 -S ./02_bowtie/$samfile.sam`.
 It uses many of the variables created
earlier, so hopefully now it gets more clear why we need them. The next
command is another echo command which tells us that the bowtie command
is finished. `bowtie2` creates a socalled SAM file which contains
informations about the mapped reads such as the name of the read, the
sequence and the mapping coordinates. SAM files are text format files
and you could theoretically look at them with cat although this may not
be a good idea because SAM files can get very large. This is why we have
to compress them into a compressed binary format. Compressed SAM files
are called BAM files and we compress SAM files like this: `samtools view -bS ./02_bowtie/$samfile.sam | samtools sort - ./02_bowtie/$samfile`
and work only with the compressed files. The next line only has the done
command which indicates the end of the for loop. After the loop has
finished the last few lines of code will combine all just created BAM
files into a single file sort the file and creates an index file.

#### 2.4.3.1 The point of scripts - power, speed, reproducibility

I hope you can see that this shell script has done a lot of complex work
at once. Entering all these commands with the correct flags and file
locations from the command line directly would be very difficult, error
prone and would take a lot of time especially if you have to do this
several times. **Scripts aid reproducibility and save you work.**

Imagine that you were instead in a GUI environment in Windows, and had
to click buttons and type into text boxes to change analysis parameters.
Just to set the information contained in those few lines of shell script
described above would be much more work. What if you had to repeat 3
times, changing one parameter but setting all the others the same? That
would be easy with a shell script where you already know how to search
and replace text in a text file (like a script) from the command line,
but requires a lot of repetition in a GUI. What if you had to do this
1000 times across different combinations of parameters and write
informative output filenames to record which parameters were used?
Impossible in a GUI but straightforward with only a few basic scripting
skills. You could also give the script to a collaborator to run the same
analysis on their data. Or better yet add your analysis script to the
massive number already available online for all to use without
restriction.

```{note}
I hope you can see the reasons that bioinformaticians, and most modern
biologists working with lots of data, use the command line and scripts
rather than proprietary GUI programs.**
```

## 2.6 Tasks- review of command line skills

Use the skills you have learned to answer the questions below. The
information you need is in the sections above but this is 'open book',
you can use the internet just like a proper bioinformatician.

1.  How many `tef1` sequences are there in the `parmelia_sequences.fas`?

2.  What is the last sequence record? (No scrolling down please!)

3.  Search for `beta-tubulin` and replace it with Btub

4.  Demonstrate that you can (a) edit the file directly (b) save the
    edit as a new file

# 3 Additional reading

1.  UNIX Tutorial for Beginners
    [http://www.ee.surrey.ac.uk/Teaching/Unix/](http://www.ee.surrey.ac.uk/Teaching/Unix/)

2.  Command line history tricks
    [http://www.thegeekstuff.com/2008/08/15-examples-to-master-linux-command-line-history/](http://www.thegeekstuff.com/2008/08/15-examples-to-master-linux-command-line-history/)

3.  Software Carpentry Introduction to the unix shell [on
    YouTube](https://www.youtube.com/results?search_query=software+carpentry+September+2012+unix+shell)
    (great short videos)

4.  Unix and Perl Primer for Biologists
    [http://korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.0.pdf](http://korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.0.pdf)

5.  Bradnam and Korf. (2012) UNIX and Perl to the Rescue!: A Field Guide
    for the Life Sciences (and Other Data-rich Pursuits). ISBN-10:
    0521169828 ISBN-13: 978-0521169820
    [http://www.amazon.co.uk/gp/product/0521169828](http://www.amazon.co.uk/gp/product/0521169828)

6.  GREP
    [http://www.gnu.org/software/grep/manual/grep.html](http://www.gnu.org/software/grep/manual/grep.html)

7.  SED
    [http://www.gnu.org/software/sed/manual/sed.html](http://www.gnu.org/software/sed/manual/sed.html)

8.  Software Carpentry Introduction to programming in Python ([great
    short YouTube videos](https://www.youtube.com/results?search_query=software+carpentry+September+2012+python))

9.  Python for Biologists
    [http://pythonforbiologists.com](http://pythonforbiologists.com)

10. Python for non-programmers
    [https://wiki.python.org/moin/BeginnersGuide/NonProgrammers](https://wiki.python.org/moin/BeginnersGuide/NonProgrammers)