Length of tutorial: 3hr
Difficulty: Moderate
Objectives
- Introduce complete beginners to Unix
- Explain the following:
- General navigation on the command line
- Finding things
- Extracting information
When we are familiar with interacting with computers with a mouse (point and click), the notion of doing things on the command line can seem complex and counter intuitive. In bioinformatics the difference between using the command line or a graphical interface can be in the order of weeks or even months.
Tutorial 1
Go to this link and signup for this tutorial, it’s a really nice, interactive introduction to Linux:
Tutorial 2
You now need to sign into your terminal:
For Mac users:
ssh username@dtp2017.climb.ac.uk
For Windows users:
You need to follow the putty steps from earlier today!
Tutorial 2a
You can now have a go at finding stuff, go to this link and follow the tutorial (haiku.txt is in your home directory):
Follow the tutorial, trying the commands in your terminal window.
https://swcarpentry.github.io/shell-novice/07-find/
Once you get to the bit about species etc. then move onto the next tutorial.
P.S If you have the inclination the whole tutorial is great (go to https://swcarpentry.github.io/shell-novice)
Tutorial 2b
Now, with your new found Linux skills go to the exercises_grep_awk directory like this:
cd /opt/exercises_grep_awk/
Now you have to find all lines in the file data/exercise1_grep.txt that contain the word start and follow the instructions within the tutorial!
clue: try using grep
Once you have finished that exercise try this
head data/genes/chr8.gff
As you can see it is a tab-separated file, which we could easily read in Excel or Calc.
The format specifications are defined here, but in short:
- The first, fourth and fifth columns contain the chromosome name and coordinates
- The second column describes the tool or resource that generated the annotation
- The third column describe the type of feature (e.g. gene, transcript, exon, TF binding site, Histone Acetylation mark, etc…
- The ninth column contains several fields, separated by a semicolon
Can you print all the lines between 5000000 and 10000000?
Try this command, what is it doing?
awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' | head
Now try this command, and figure out what it is doing
awk '$9 ~ /symbol=MIR/ {print $0}' data/genes/chr8.gff
By using a modification of the last commands can you calculate the length of the gene POU5F1B?
ANSWER will appear after discussion