Analysis

Complete Examples

Here we will show you a complete example of running the pipeline using some test data that is included with the source code.

Note: Any time you see

$> command

It means you should be able to type that into your terminal

All examples will assume your current working directory is inside of the git cloned ngs_mapper directory, aka the following command ends with ngs_mapper:

$> pwd

For both examples below, as always when running the pipeline, you need to ensure your installation is activated:

$> . ~/.ngs_mapper/bin/activate

The location of our data sets are under ngs_mapper/tests/fixtures/functional

$> ls ngs_mapper/tests/fixtures/functional
780  780.conf  780.ref.fasta  947  947.conf  947.ref.fasta

Here you can see we have 2 data sets to play with.

  • 780 is an H3N2 data set
  • 947 is a Dengue 4 data set

You will notice that there is a 780/947 directory and a 780/947.ref.fasta file. The 780/947 directory contains all the read files for the 780/947 sample while the 780/947.ref.fasta is the reference to map to. You can ignore the .conf files, they are used by the automated tests.

Quick note about your files

In previous versions of the pipeline the names of your read files were used to identify which platform the reads came from.

Now the reads are identified via the way the first identifier in each file is named.

You can read more about this here

Using runsample.py to run a single sample

Some times you just need to run a single sample. Here we will use runsample.py to run the 947 example data set and have the analysis be put into a directory called 947 in the current directory

First, let’s see what options there are available for the runsample.py script to use

$> runsample.py
usage: runsample.py [-h] [--config CONFIG] [-trim_qual TRIM_QUAL]
                    [-head_crop HEAD_CROP] [-minth MINTH] [--CN CN]
                    [-od OUTDIR]
                    readsdir reference prefix
runsample.py: error: too few arguments

What you can take from this is:

  • Anything inside of a [] block means that argument to the script is optional and has a default value that will be used if you do not specify it.
  • readsdir, reference and prefix are all required arguments that you MUST specify

So to run the project with the fewest amount of arguments would be as follows(don’t run this, just an example):

$> runsample.py ngs_mapper/tests/fixtures/functional/947 ngs_mapper/tests/fixtures/functional/947.ref.fasta -od 947 947

This will run the 947 data and use the 947.ref.fasta file to map to. All files will be prefixed with 947. Since we did not specify the -od argument, all the files from the pipeline get dumped into your current directory.

Most likely you will want to specify a separate directory to put all the 947 specific analysis files into. But how?

We can get extended help information which should print the defualts as well from any script by using the --help option

$> runsample.py --help
runsample.py --help
usage: runsample.py [-h] [--config CONFIG] [-trim_qual TRIM_QUAL]
                    [-head_crop HEAD_CROP] [-minth MINTH] [--CN CN]
                    [-od OUTDIR]
                    readsdir reference prefix

Runs a single sample through the pipeline

positional arguments:
  readsdir              Directory that contains reads to be mapped
  reference             The path to the reference to map to
  prefix                The prefix to put before every output file generated.
                        Probably the samplename

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
                        Path to config.yaml file
  -trim_qual TRIM_QUAL  Quality threshold to trim[Default: 20]
  -head_crop HEAD_CROP  How many bases to crop off the beginning of the reads
                        after quality trimming[Default: 0]
  -minth MINTH          Minimum fraction of all remaining bases after
                        trimming/N calling that will trigger a base to be
                        called[Default: 0.8]
  --CN CN               Sets the CN tag inside of each read group to the value
                        specified.[Default: None]
  -od OUTDIR, --outdir OUTDIR
                        The output directory for all files to be put[Default:
                        /home/myusername/ngs_mapper]

You can see that --help gives us the same initial output as just running runsample.py without any arguments, but also contains extended help for all the arguments. The --help argument is available for all ngs_mapper scripts that end in .py(If you find one that doesn’t, head over to Creating Issues and file a new Bug Report.

So you can see the -od option’s default is our current directory. So if we want our analysis files to go into a specific directory for each sample we run we can specify a different directory. While we are at it, lets try specifying some of the other optional arguments too.

Let’s tell runsample.py to put our analysis into a directory called 947 and also tell it to crop off 20 bases from the beginning of each read.

$> runsample.py -od 947 -head_crop 20 ngs_mapper/tests/fixtures/functional/947 ngs_mapper/tests/fixtures/functional/947.ref.fasta 947
2014-12-22 10:17:52,465 -- INFO -- runsample       --- Starting 947 ---
2014-12-22 10:21:28,526 -- INFO -- runsample       --- Finished 947 ---

You can see from the output that the sample started and finished. If there were errors, they would show up in between those two lines and you would have to view the Help documentation.

So what analysis files were created? You can see them by listing the output directory:

$> ls 947
-rw-r--r--. 1 myusername users 36758279 Dec 22 10:19 947.bam
-rw-r--r--. 1 myusername users       96 Dec 22 10:19 947.bam.bai
-rw-r--r--. 1 myusername users    10869 Dec 22 10:21 947.bam.consensus.fasta
-rw-r--r--. 1 myusername users   269058 Dec 22 10:21 947.bam.qualdepth.json
-rw-r--r--. 1 myusername users   204502 Dec 22 10:21 947.bam.qualdepth.png
-rw-r--r--. 1 myusername users  1291367 Dec 22 10:20 947.bam.vcf
-rw-r--r--. 1 myusername users     2414 Dec 22 10:21 947.log
-rw-r--r--. 1 myusername users   307180 Dec 22 10:21 947.reads.png
-rw-r--r--. 1 myusername users    10840 Dec 22 10:17 947.ref.fasta
-rw-r--r--. 1 myusername users       10 Dec 22 10:18 947.ref.fasta.amb
-rw-r--r--. 1 myusername users       67 Dec 22 10:18 947.ref.fasta.ann
-rw-r--r--. 1 myusername users    10744 Dec 22 10:18 947.ref.fasta.bwt
-rw-r--r--. 1 myusername users     2664 Dec 22 10:18 947.ref.fasta.pac
-rw-r--r--. 1 myusername users     5376 Dec 22 10:18 947.ref.fasta.sa
-rw-r--r--. 1 myusername users     2770 Dec 22 10:21 947.std.log
-rw-r--r--. 1 myusername users    17219 Dec 22 10:18 bwa.log
-rw-r--r--. 1 myusername users      380 Dec 22 10:20 flagstats.txt
-rw-r--r--. 1 myusername users      249 Dec 22 10:21 graphsample.log
-rw-r--r--. 1 myusername users   137212 Dec 22 10:19 pipeline.log
drwxr-xr-x. 2 myusername users     4096 Dec 22 10:21 qualdepth
drwxr-xr-x. 2 myusername users     4096 Dec 22 10:18 trimmed_reads
drwxr-xr-x. 2 myusername users     4096 Dec 22 10:17 trim_stats

You can view information about each of the output files via the runsample-output-directory

An easy way to view your bam file quickly from the command line if you have igv installed is like this:

igv.sh -g 947/947.ref.fasta 947/947.bam

Using runsamplesheet.sh to run multiple samples in parallel

runsamplesheet.sh is just a wrapper script that makes running runsample.py on a bunch of samples easier.

You just have to first create a Samplesheet then you just have to run it as follows:

$> runsamplesheet.sh /path/to/NGSData/ReadsBySample samplesheet.tsv

So let’s run the 947 and 780 samples as our example.

  1. Make a directory for all of our analysis to go into

    $> mkdir -p tutorial
    $> cd tutorial
    
  2. Create a new file called samplesheet.tsv and put the following in it(you can use gedit samplesheet.tsv to edit/save the file):

    947 ../ngs_mapper/tests/fixtures/functional/947.ref.fasta
    780 ../ngs_mapper/tests/fixtures/functional/780.ref.fasta
    
  3. Run your samplesheet with runsamplesheet.sh

    $> runsamplesheet.sh ../ngs_mapper/tests/fixtures/functional samplesheet.tsv
    2014-12-22 12:30:25,381 -- INFO -- runsample       --- Starting 780 ---
    2014-12-22 12:30:25,381 -- INFO -- runsample       --- Starting 947 ---
    2014-12-22 12:30:50,834 -- INFO -- runsample       --- Finished 780 ---
    2014-12-22 12:34:08,523 -- INFO -- runsample       --- Finished 947 ---
    1.82user 0.05system 0:01.01elapsed 185%CPU (0avgtext+0avgdata 242912maxresident)k
    0inputs+728outputs (1major+26371minor)pagefaults 0swaps
    5.02user 0.11system 0:04.03elapsed 127%CPU (0avgtext+0avgdata 981104maxresident)k
    0inputs+3160outputs (1major+77772minor)pagefaults 0swaps
    2014-12-22 12:34:19,843 -- WARNING -- graph_times     Projects/780 ran in only 25 seconds
    2014-12-22 12:34:19,843 -- INFO -- graph_times     Plotting all projects inside of Projects
    

You can see that the pipeline ran both of our samples at the same time in parallel. The pipeline tries to determine how many CPU cores your system has and will run that many samples in parallel.

You can then view all of the resulting output files/directories created

$> ls -l
total 1184
-rw-r--r--. 1 myusername users   2101 Dec 22 12:34 graphsample.log
-rw-r--r--. 1 myusername users  50794 Dec 22 12:34 MapUnmapReads.png
-rw-r--r--. 1 myusername users 756139 Dec 22 12:34 pipeline.log
-rw-r--r--. 1 myusername users  34857 Dec 22 12:34 PipelineTimes.png
drwxr-xr-x. 4 myusername users   4096 Dec 22 12:34 Projects
-rw-r--r--. 1 myusername users 292764 Dec 22 12:34 QualDepth.pdf
-rw-r--r--. 1 myusername users  52064 Dec 22 12:34 SampleCoverage.png
-rw-r--r--. 1 myusername users    122 Dec 22 12:28 samplesheet.tsv
drwxr-xr-x. 2 myusername users   4096 Dec 22 12:34 vcf_consensus

You can view what each of these files means by heading over to the runsamplesheet.sh

Changing defaults for pipeline stages

If you want to change any of the settings of any of the pipeline stages you will need to create a config.yaml and supply it to runsample.py using the -c option. You can read more about how to create the config and edit it via the config.yaml script’s page

Rerunning Samples

Rerunning samples is very similar to just running samples.

  1. Copy and edit the existing Samplesheet and comment out or delete the samples you do not want to rerun.

  2. Run the runsamplesheet.sh script on the modified samplesheet
    • Note: As of right now, you will have to manually remove the existing project directories that you want to rerun.
  3. Regenerate graphics for all samples
    • The -norecreate tells it not to recreate the qualdepth.json for each sample which is very time consuming. The reran samples should already have recreated their qualdepth.json files when runsample.py was run on them.

      graphs.sh -norecreate
      
  4. You should not have to rerun consensuses.sh as it just symlinks the files

Temporary Directories/Files

The pipeline initially creates a temporary analysis directory for each sample that you run with runsample.py. By default this directory will be created in your system’s configured temporary directory(most likely /tmp). This is especially useful if your /tmp partition is not very large or if you have a custom temporary partition that is on a very fast hard drive such as a Solid State Drive that you want to use.

It is important that you first create the temporary directory as it will not be created for you(/tmp is already available from when Linux was installed though, FYI).

You can control what directory this is by utilizing the TMPDIR environmental variable as follows:

mkdir -p /path/to/custom/tmpdir
export TMPDIR=/path/to/custom/tmpdir
SAMPLE=samplename
runsample.py /path/to/NGSData/ReadsBySample/${SAMPLE} /path/to/reference ${SAMPLE} -od Projects/${SAMPLE}