Examples
========
Video clips
-----------
These short clips should give just a feel for how *ASCIIGenome* works in interactive
mode.
.. raw:: html
This example loads a bam file and shows the read track at the bottom
and the coverage track on top. After loading the bam file, the follow operations
are executed at the command prompt:
* Go to specified region.
* Zoom in.
* Repeatedly move forward by half a window size.
* Zoom out.
* Filter reads using samtools-like syntax.
----
.. raw:: html
Here there are two bigWig files of ChIP-Seq profiles. The corresponding peak regions
are separately loaded. The following operations are executed at the command prompt:
* Change track height to 12 lines each.
* Add two *narrowPeak* files, captured with the glob :code:`*.narrowPeak.gz`.
* Re-order tracks.
* Zoom out.
* Hide title lines.
* Change limits of the Y-axes to be from 0 to maximum of all tracks.
* Move to the next bed feature.
* Show help for the :code:`ylim` command.
Why the command line
--------------------
At first it may be nonsensical to use a viewer without graphical interface. In
addition, typed commands involve a steeper learning curve and more frustration.
However, the command line interface has some great benefits which, at least in
part, explain why most bioinformaticians prefer R, python and Unix tools over
Excel and Galaxy (which are great, by the way, just like IGV is great). Some advantages
of the command line interface over the GUI:
* Streamline repetitive tasks.
* Finer control of the commands.
* Self-documented and therefore reproducible.
This example should illustrate these points. For a start, most of my data files,
especially the big ones, live on the institutes's computer cluster or on our
group server. Almost nothing is stored on my workstation. Consequently pretty
much all the work I do is via Unix commands, bash and the familiar samtools, bedtools,
etc. Popping up a GUI is often a disturbance of the workflow so having a
visualisation tool with the same interface (*i.e.* command line) as these tools
is more natural.
Now, say we want to visualise the following files::
ts058_TS10-PEO1-Pt-A2.tdf
ts059_TS10-PEO1-Pt-A4.tdf
ts060_TS11-PEO1-DMF-A13.tdf
ts061_TS11-PEO1-Pt-A12.tdf
ts062_TS11-PEO1-Pt-A7.tdf
ts063_TS11-PEO4-DMF-A5.tdf
ts064_TS11-PEO4-Pt-A2.tdf
ts065_TS11-PEO4-Pt-A4.tdf
ts069_PEO1-DMF-A5.tdf
ts070_PEO1-Pt-A18.tdf
ts071_PEO1-Pt-A2.tdf
Loading all these files by clinking one by one through a GUI can be annoying
especially if they are in different directories, not counting the time spent to
pop up the GUI and scroll through the relevant menus. With *ASCIIGenome* you can
probably just do one of the following::
ASCIIGenome ts0{58..71}*.tdf
ASCIIGenome *PEO1*.tdf *PEO4*.tdf
ASCIIGenome `find . -name '*PEO*.tdf'` ## If files are in different subdirs
ASCIIGenome ts058_TS10-PEO1-Pt-A2.tdf ts059_TS10-PEO1-Pt-A4.tdf
The command line you have used can be copied in your documentation for reference and it can
be used again by copying and pasting it to the terminal.
Once these files have been loaded you may want to order them to have the PEO1
tracks before the PEO4s. This is just::
orderTracks PEO1 PEO4
Similarly, settings can be changed without the need of scrolling through
menu options, for example::
colorTrack blue PEO1 <- Turn blue the PEO1 tracks
trackHeight 10 DMF <- Make 10 lines high the tracks matching DMF
Furthermore, the commands issued at the prompt can be scrolled with the UP and DOWN arrow
keys. So if we want to change the colour of the PEO1 tracks again we just need to press UP two times,
bring back the :code:`colorTrack blue PEO1` command, and edit it as required.
We can put this together in a single command which, again, can go to the documentation::
ASCIIGenome -x 'orderTracks PEO1 PEO4 && colorTrack blue PEO1 && trackHeight 10 DMF' ts0{58..71}*.tdf
Open and browse
---------------
Open some peak and bigWig files from
`ENCODE `_.
.. note:: Opening remote files is a little slow (IGV seems equally slow). You might also need to start Java with the option `-Djava.net.useSystemProxies=true` (see also `issue#6 `_)
::
encode=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs
ASCIIGenome \
$encode/wgEncodeSydhTfbsGm10847NfkbTnfaIggrabPk.narrowPeak.gz \
$encode/wgEncodeSydhTfbsGm10847NfkbTnfaIggrabSig.bigWig \
$encode/wgEncodeSydhTfbsGm12892Pol2IggmusPk.narrowPeak.gz \
$encode/wgEncodeSydhTfbsGm12892Pol2IggmusSig.bigWig
Find the first feature on the first file, then change colour of one of the tracks. Reset y axes to
span 0 to 50, finally save as pdf to default file name::
[h] for help: setGenome hg19
[h] for help: next #1
[h] for help: colorTrack pink wgEncodeSydhTfbsGm12892Pol2IggmusSig
[h] for help: ylim 0 50
[h] for help: save %r.pdf
Result on terminal screen should look like this:
.. image:: screenshots/chr1_996137-1003137.png
The screenshot is saved to *chr1_996137-1003137.pdf*, note that the variable :code:`%r` is expanded to the genomic coordinates.
Finding & filtering stuff
-------------------------
Once started, :code:`ASCIIGenome` makes it easy to browse the genome. The picture below shows the distribution of transcripts on chromosome 36 of *Leishmania major*. It is clearly visible how transcripts in *Leishmania* tend to be grouped in blocks transcribed from the same direction (blue: forward strand, pink: reverse strand). Note how overlapping features are stacked on top of each other.
This screenshot has been produced by first loading the *L. major* GTF file::
ASCIIGenome ftp://ftp.ensemblgenomes.org/pub/release-31/protists/gtf/leishmania_major/Leishmania_major.ASM272v2.84.gtf.gz
At the command prompt issue the following commands::
[h] for help: goto 36:1-2682151
[h] for help: grep -i transcript
[h] for help: trackHeight 100
.. image:: screenshots/leishmania_transcripts.png
Now return to the start of the chromosome and find the first feature containing *LmjF.36.TRNAGLN.01*,
print it to screen::
[h] for help: 1
[h] for help: find LmjF.36.TRNAGLN.01
[h] for help: print
Now showing:
.. image:: screenshots/leishmania_find.png
.. _Batch-processing:
Advanced filtering with `awk`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From version 1.5.0, the `awk `_ program is available within ASCIIGenome to filter interval features and SAM records.
See the command reference for *awk* for more detials. This an example of using awk to filter reads containing deletions, *i.e.* where the cigar string contains
the character *D*::
awk '$6 ~ "D"' *.bam
The syntax ``$6`` selects the 6th column, the one containing the cigar string, the operator ``~``
returns true if the cigar string contains the string ``D``. As usual, the last positional argument
applies the command to the tracks matching the given regex.
The internal function ``get()`` returns the value of the additional tags in SAM, GFF/GTF, or VCF files.
For example, to select only alignment containing mismatches, *i.e.* where the NM tag is greater than zero, use::
awk 'get("NM") > 0'
Filter VCF records where tag ``CalledBy`` contains "WCMC". Note the ``~``
operator to perform refular expression matching::
awk 'get("CalledBy") ~ "WCMC"'
or where ``CalledBy`` does not contain ``WC``::
awk 'get("CalledBy") !~ "WCMC"'
If ``CalledBy`` contains a list of values you can access them with the second
parament of ``get(tag, index)`` indicating the index of the value to be returned, 1-based.
*E.g.*, return records where the second element of ``CalledBy`` equals
``BCM``::
awk 'get("CalledBy", 2) == "BCM"'
When working with long reads, it is often useful to filter for aligmments longer than a cutoff::
awk 'getAlnLen() > 2000'
Batch and non-interactive mode
------------------------------
*ASCIIGenome* can be integrated in a script to be executed without direct human
intervention. For example, a simple bash script may contain the following
commands::
#!/bin/bash
## Find ChIP-Seq peaks
CHIP=ChIP.bam
macs2 callpeak -t $CHIP -c input.bam -n out
## Output pdf in a "control" region for later visual inspection
ASCIIGenome -ni -r chr1:1000000-1020000 -x "save ${CHIP}.ctrl.pdf" \
$CHIP input.bam out_peaks.narrowPeak > /dev/null
In this script, a ChIP-Seq sample is first analysed to find peaks against an
input control. ChIP, input and output from the peak caller are then loaded in
*ASCIIGenome* to visualize a region of interest. *ASCIIGenome* will save the
image in pdf file named after the ChIP sample and exit. An investigator can
later inspect the pdf figure to assess the quality of the ChIP or to check
whether a peak has been detected.
The example above can easily be extended to several regions to be visualised in
batch for one or more tracks. For example, you have a list of ChIP-Seq peaks or
RNA-Seq genes and you want to see the coverage profiles together with an
annotation file. :code:`ASCIIGenome` allows batch processing via the
:code:`--batchFile/-b` option.
This script iterates through the intervals in *peaks.bed*. For each interval, it displays two
bigWig, a gtf file and the peak file itself. Each interval is zoomed out 3 times and the screenshot
saved as pdf to :code:`/tmp/peak.%r.pdf`, where `%r` is a special variable expanded to the current
coordinates as `chrom_start-end`.::
ASCIIGenome -b peaks.bed \
-x 'zo 3 && save /tmp/peak.%r.pdf' \
chipseq.bigwig \
input.bigwig \
gencode_genes.gtf \
peaks.bed > /dev/null
To save all the screenshots in a single pdf use the >> operator in the *save* command, *e.g.* :code:`save >> myScreenshots.pdf`.
Finding sequence motifs
-----------------------
The reference fasta sequence can be searched for sequence motifs specified via regular expressions
or via `IUPAC notation `_.
This example is from `Biostars `_. We want to find matches of
the motif TATAWAA near gene ENSG00000168487.
First load the reference sequence and a (remote) annotation file::
ASCIIGenome -fa Homo_sapiens.GRCh38.dna.chromosome.8.fa \
ftp://ftp.ensembl.org/pub/release-86/gff3/homo_sapiens/Homo_sapiens.GRCh38.86.chromosome.8.gff3.gz
Then at the command prompt issue these commands::
find ENSG00000168487
grep -i \tgene\t.*ENSG00000168487 gff3
seqRegex -iupac TATAWAA
zo 8
print seqRegex
print seqRegex > matches.bed
save matches.png
Explained: Find the gene ENSG00000168487, for clarity only show the "gene" feature (:code:`grep...`).
Then search the motif TATAWAA interpreted as iupac notation; zoom out *x* times (e.g. 8 times) to see some
matches in the sequence.
The matches here are shown on screen with :code:`print seqRegex` and then saved to file with :code:`print seqRegex > matches.bed`. Finally save a picture as png, shown here:
.. image:: screenshots/matches.png