too-many-peaks
Walkthrough
Table of Contents
This is an instructional example of using too-many-peaks
meant to demonstrate
typical usage.
For more information about too-many-peaks
and too-many-cells
:
See https://github.com/GregorySchwartz/too-many-cells for latest version.
Install too-many-peaks
Install too-many-cells
too-many-peaks
is an extension included in too-many-cells
, so you should
follow instructions on https://gregoryschwartz.github.io/too-many-cells/ for
details on too-many-cells
. First, clone the too-many-cells
repository.
git clone https://github.com/GregorySchwartz/too-many-cells.git
Enter the folder and install with nix.
cd ./too-many-cells
nix-env -f default.nix -i too-many-cells
Adding to path (with stack installation)
If using stack, the resulting binary will install to ~/.local/bin
. Add to
$PATH
so you can invoke the command from anywhere!
export PATH=$HOME/.local/bin:$PATH
This command will only work in the current shell. To permanently add to path,
add the previous line to ~/.bashrc
or ~/.profile
.
Testing the installation
Test to see if the installation worked when in path:
too-many-cells -h
too-many-cells, Gregory W. Schwartz. Clusters and analyzes single cell data. Usage: too-many-cells (make-tree | interactive | differential | diversity | paths | classify | peaks | motifs | matrix-output) Available options: -h,--help Show this help text Available commands: make-tree interactive differential diversity paths classify peaks motifs matrix-output
Here, seeing too-many-cells peaks
denotes the correct version of
too-many-cells
to use too-many-peaks
.
Data download
Download peripheral blood mononuclear cells
We are going to see if we can identify T cells in a collection of peripheral blood mononuclear cells. We'll need data from 10x, specifically https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_v1_pbmc_5k as a quick, illustrative example here. We can use the peak matrix or make our own from the fragment file, so let's do the latter:
# Make the data directory mkdir -p data/pbmc # Enter the directory cd ./data/pbmc # Download the data wget "https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_pbmc_5k/atac_v1_pbmc_5k_fragments.tsv.gz"
Tree creation with too-many-peaks
Initial tree creation
We now have everything we need for initial runs with too-many-peaks
! Let's begin
by building a tree (ignore printf
throughout this document, they are just
reporting the resulting file). All of the capabilities of too-many-cells
are
available to us, so check out the too-many-cells
workshop for more details on
coloring, labels, and joining together multiple data sets.
too-many-cells make-tree \ --matrix-path ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \ --filter-thresholds "(1000, 1)" \ --binwidth 5000 \ --lsa 50 \ --normalization NoneNorm \ --output out \ --matrix-output mat \ > clusters.csv printf "./out/dendrogram.svg"
The initial tree is built! It tells us the tree structure and the number of
cells in each leaf. Here, we used --binwidth
to specify the width in bp of the
genome to define features. We used --filter-thresholds
to specify that each
cell must have 1000 fragments and each bin must have at least 1 fragment.
Importantly, we could also use --cell-whitelist-file
to specify exact cells we
want to include (as a list of barcodes) as Cell Ranger will output a peak matrix
with barcodes believed to be cells, but for now we will just use these simple
thresholds. Also of importance are blacklist regions which can be ignored
through --blacklist-regions-file
using a file like
http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/Anshul_Hg19UltraHighSignalArtifactRegions.bed.gz.
Lastly, --lsa 50
speeds things up by using dimensionality reduction on the
matrix and also helps filter features, but needs --normalization NoneNorm
to
make sure we aren't transforming the data in unexpected ways.
Importantly, we also specified --matrix-output
for a folder to output the
generated matrix in (calculated from the fragment file). This step will make
subsequent steps faster as we can just input that matrix path.
Finding peaks
Pruning the tree
Now we want to define more precise features for each leaf, so we will trade in
our bins for peaks. Before that, however, many peak finding algorithms require a
minimum of ~200 cells to assign peaks, so we will first prune the tree. While we
could have specified a minimum cell value in our initial run, it's good to have
the entire tree just in case we want to prune differently. Note:
We use --prior
from now on so we don't need to calculate the tree all over
again. This argument makes things much faster!
too-many-cells make-tree \ --prior out \ --min-size 200 \ --output out_min_200 \ > clusters_min_200.csv printf "./out_min_200/dendrogram.svg"
Peak calling
Now we are ready to call our peaks using MACS2 (or any other program specified
by --peak-call-command
).
MACS2 needs the chromosome sizes for our data, so we will quickly fetch those.
cd ./data wget "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes" printf "./data/hg19.chrom.sizes"
Now we can find peaks for every node!
too-many-cells peaks \ --prior out_min_200 \ --fragments-path ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \ --all-nodes \ --genome ./data/hg19.chrom.sizes \ --bedgraph \ -o out_min_200_peaks \ +RTS -N6
By default, this command will find peaks for all leaves as this is a
time-consuming process. --all-nodes
tells too-many-peaks
to find peaks for
every node, including internal ones. We can specify individual nodes with
=--peak-node
, but we need to also use --all-nodes
if that node is internal.
If we had labels, we can find peaks for specific subsets of cells in nodes using
--peak-node-labels
as well!
--bedgraph
tells the program to output bdg files for easy plotting of tracks,
and +RTS -N6
, specified at the end of the command, tells too-many-peaks
how
many cores to use.
The specified output folder will contain many files, such as:
File | Description |
---|---|
out_min_200_peaks/cluster_fragments |
fragments.tsv.gz files for each node. |
out_min_200_peaks/cluster_bedgraphs |
Bedgraphs and bigwigs if specified using --bedgraph for track visualization uses. |
out_min_200_peaks/cluster_peaks/union.bdg |
Merged peaks across all requested nodes in bedgraph format. |
out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz |
Merged peaks across all requested nodes in fragments.tsv.gz format. |
out_min_200_peaks/cluster_peaks/ |
Folder containing merged peaks across nodes and peaks for each individual node in each folder. |
From here, all of the normal too-many-cells
processes work on the tree with
the union_fragments.tsv.gz
file. For instance, to see chr2:87011729-87035519
accessibility on the tree, we can do:
too-many-cells make-tree \ --prior out_min_200 \ -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \ --draw-leaf "DrawItem (DrawContinuous [\"chr2:87011729-87035519\"])" \ --normalization "LogCPMNorm 2" \ --custom-region "chr2:87011729-87035519" \ --draw-mark "MarkModularity" \ --output out_min_200 \ > clusters_min_200.csv
For a faster process, you may first want to create a smaller matrix with the regions of interest, then plot those regions in different ways by using the output matrix rather than all of the fragments:
too-many-cells matrix-output \ -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \ --normalization NoneNorm \ --custom-region "chr2:87011729-87035519" \ --mat-output mini_mat
For more information about the capabilities of visualization and differential expression, check out https://gregoryschwartz.github.io/too-many-cells/!
Identify NK cells
Now that we have a base tree with higher resolution peaks, we can now try
searching for known cell populations such as NK cells. While we can use the
classify
entry point of too-many-cells
to link bulk reference data with
single-cell data, we will use basic known markers to exemplify the visualization
features of the tree. Here, we will only focus on the NKG7
region for NK
cells. So, let's look at what that accessibility looks like on the tree at that
region, making sure to overlay node numbers for easy reference! For maximum
resolution, we'll use the full tree rather than the pruned tree.
too-many-cells make-tree \ --prior out \ -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \ --draw-leaf "DrawItem (DrawContinuous [\"chr19:51874860-51875969\"])" \ --custom-region "chr19:51874860-51875969" \ --draw-mark "MarkModularity" \ --dendrogram-output "NKG7.svg" \ --draw-node-number \ --draw-scale-saturation 10 \ --output out \ > clusters.csv printf "./out/NKG7.svg"
Here, --custom-region
tells too-many-peaks
to create a new feature within
that specific region. We can also use the original fragments to see the
accessibility on the tree before peak finding and filtering.
too-many-cells make-tree \ --prior out \ -m ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \ --draw-leaf "DrawItem (DrawContinuous [\"chr19:51874860-51875969\"])" \ --custom-region "chr19:51874860-51875969" \ --draw-mark "MarkModularity" \ --dendrogram-output "NKG7_raw.svg" \ --draw-node-number \ --draw-scale-saturation 10 \ --output out \ > clusters.csv printf "./out/NKG7_raw.svg"
Based on the coloring and the node number overlay,
there seems to be a high level of accessibility within node 85.
To further investigate, let's see what the differential accessibility is between
node 85 and the rest of the tree (seeing more than the top 100 features and without
using edgeR
for scATAC-seq):
too-many-cells differential \ -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \ --prior out \ --nodes "([118,1], [85])" \ --normalization "TotalNorm" \ --top-n 1000000000 \ > ./out/NK_vs_other.csv printf "./out/NK_vs_other.csv"
These results are liable to change with the inclusion of
--blacklist-regions-file
, which should filter out unwanted regions (as noted
above). We can also see just our specific region:
too-many-cells differential \ -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \ --prior out \ --nodes "([118,1], [85])" \ --normalization "TotalNorm" \ --custom-region "chr19:51874860-51875969" \ > ./out/NK_vs_other_NKG7.csv printf "./out/NK_vs_other_NKG7.csv"
As expected, there's some difference at the NKG7
locus. Now we can see what
motifs may be enriched in this differential.
Motifs
too-many-peaks
can also identify motifs from differential expression analyses
using tools such as MEME and HOMER. For instance, with HOMER's
findMotifsGenome.pl
in your path, you can use the input from
too-many-peaks
's differential accessibility output from
too-many-cells differential
that we just calculated to find enriched motifs
(getting rid of infinity fold changes, or "divide by zero"):
cat ./out/NK_vs_other.csv | csvsql --query "SELECT * FROM stdin WHERE qVal < 0.05 AND log2FC > 0" | grep -v inf | grep -v Infinity > tmp.csv too-many-cells motifs \ --diff-file tmp.csv \ --motif-genome hg19 \ --top-n 100000000 \ -o homer_out \ --motif-genome-command "findMotifsGenome.pl %s %s %s"
This command outputs motifs in the homer_out
directory found using
findMotifsGenome.pl
on the significant and positive differentially accessible
sites from ./out/BLANK_vs_BLANK.csv
which was output from our too-many-cells
differential
run, (so we set --top-n
to a high number to include all sites
instead of just the top sites).
These kinds of analyses and more are all available using too-many-peaks
, which
makes full use of the too-many-cells
suite of tools so be sure to check it
out!