Beginner's Unix

Length of tutorial: 3hr
Difficulty: Moderate

Objectives

  • Introduce complete beginners to Unix
  • Explain the following:
  • General navigation on the command line
  • Finding things
  • Extracting information

When we are familiar with interacting with computers with a mouse (point and click), the notion of doing things on the command line can seem complex and counter intuitive. In bioinformatics the difference between using the command line or a graphical interface can be in the order of weeks or even months.

Tutorial 1

Go to this link and signup for this tutorial, it’s a really nice, interactive introduction to Linux:

Tutorial 2

You now need to sign into your terminal:

For Mac users:

ssh username@dtp2017.climb.ac.uk

For Windows users:

You need to follow the putty steps from earlier today!

Tutorial 2a

You can now have a go at finding stuff, go to this link and follow the tutorial (haiku.txt is in your home directory):

Follow the tutorial, trying the commands in your terminal window.

https://swcarpentry.github.io/shell-novice/07-find/

Once you get to the bit about species etc. then move onto the next tutorial.

P.S If you have the inclination the whole tutorial is great (go to https://swcarpentry.github.io/shell-novice)

Tutorial 2b

Now, with your new found Linux skills go to the exercises_grep_awk directory like this:

cd /opt/exercises_grep_awk/

Now you have to find all lines in the file data/exercise1_grep.txt that contain the word start and follow the instructions within the tutorial!
clue: try using grep

Once you have finished that exercise try this

head data/genes/chr8.gff

As you can see it is a tab-separated file, which we could easily read in Excel or Calc.

The format specifications are defined here, but in short:

  • The first, fourth and fifth columns contain the chromosome name and coordinates
  • The second column describes the tool or resource that generated the annotation
  • The third column describe the type of feature (e.g. gene, transcript, exon, TF binding site, Histone Acetylation mark, etc…
  • The ninth column contains several fields, separated by a semicolon

Can you print all the lines between 5000000 and 10000000?

Try this command, what is it doing?

awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' |  head

Now try this command, and figure out what it is doing

awk '$9 ~ /symbol=MIR/ {print $0}' data/genes/chr8.gff

By using a modification of the last commands can you calculate the length of the gene POU5F1B?

ANSWER will appear after discussion