too-many-peaks Walkthrough

Table of Contents

This is an instructional example of using too-many-peaks meant to demonstrate typical usage.

For more information about too-many-peaks and too-many-cells:

Website

See https://github.com/GregorySchwartz/too-many-cells for latest version.

Install too-many-peaks

Install too-many-cells

too-many-peaks is an extension included in too-many-cells, so you should follow instructions on https://gregoryschwartz.github.io/too-many-cells/ for details on too-many-cells. First, clone the too-many-cells repository.

git clone https://github.com/GregorySchwartz/too-many-cells.git

Enter the folder and install with nix.

cd ./too-many-cells
nix-env -f default.nix -i too-many-cells

Adding to path (with stack installation)

If using stack, the resulting binary will install to ~/.local/bin. Add to $PATH so you can invoke the command from anywhere!

export PATH=$HOME/.local/bin:$PATH

This command will only work in the current shell. To permanently add to path, add the previous line to ~/.bashrc or ~/.profile.

Testing the installation

Test to see if the installation worked when in path:

too-many-cells -h
too-many-cells, Gregory W. Schwartz. Clusters and analyzes single cell data.

Usage: too-many-cells (make-tree | interactive | differential | diversity |
                      paths | classify | peaks | motifs | matrix-output)

Available options:
  -h,--help                Show this help text

Available commands:
  make-tree                
  interactive              
  differential             
  diversity                
  paths                    
  classify                 
  peaks                    
  motifs                   
  matrix-output 

Here, seeing too-many-cells peaks denotes the correct version of too-many-cells to use too-many-peaks.

Data download

Download peripheral blood mononuclear cells

We are going to see if we can identify T cells in a collection of peripheral blood mononuclear cells. We'll need data from 10x, specifically https://support.10xgenomics.com/single-cell-atac/datasets/1.2.0/atac_v1_pbmc_5k as a quick, illustrative example here. We can use the peak matrix or make our own from the fragment file, so let's do the latter:

# Make the data directory
mkdir -p data/pbmc

# Enter the directory
cd ./data/pbmc

# Download the data
wget "https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_pbmc_5k/atac_v1_pbmc_5k_fragments.tsv.gz"

Tree creation with too-many-peaks

Initial tree creation

We now have everything we need for initial runs with too-many-peaks! Let's begin by building a tree (ignore printf throughout this document, they are just reporting the resulting file). All of the capabilities of too-many-cells are available to us, so check out the too-many-cells workshop for more details on coloring, labels, and joining together multiple data sets.

too-many-cells make-tree \
  --matrix-path ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \
  --filter-thresholds "(1000, 1)" \
  --binwidth 5000 \
  --lsa 50 \
  --normalization NoneNorm \
  --output out \
  --matrix-output mat \
  > clusters.csv

printf "./out/dendrogram.svg"

Sorry, your browser does not support SVG.

The initial tree is built! It tells us the tree structure and the number of cells in each leaf. Here, we used --binwidth to specify the width in bp of the genome to define features. We used --filter-thresholds to specify that each cell must have 1000 fragments and each bin must have at least 1 fragment. Importantly, we could also use --cell-whitelist-file to specify exact cells we want to include (as a list of barcodes) as Cell Ranger will output a peak matrix with barcodes believed to be cells, but for now we will just use these simple thresholds. Also of importance are blacklist regions which can be ignored through --blacklist-regions-file using a file like http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg19-human/Anshul_Hg19UltraHighSignalArtifactRegions.bed.gz. Lastly, --lsa 50 speeds things up by using dimensionality reduction on the matrix and also helps filter features, but needs --normalization NoneNorm to make sure we aren't transforming the data in unexpected ways.

Importantly, we also specified --matrix-output for a folder to output the generated matrix in (calculated from the fragment file). This step will make subsequent steps faster as we can just input that matrix path.

Finding peaks

Pruning the tree

Now we want to define more precise features for each leaf, so we will trade in our bins for peaks. Before that, however, many peak finding algorithms require a minimum of ~200 cells to assign peaks, so we will first prune the tree. While we could have specified a minimum cell value in our initial run, it's good to have the entire tree just in case we want to prune differently. Note: We use --prior from now on so we don't need to calculate the tree all over again. This argument makes things much faster!

too-many-cells make-tree \
  --prior out \
  --min-size 200 \
  --output out_min_200 \
  > clusters_min_200.csv

printf "./out_min_200/dendrogram.svg"

Sorry, your browser does not support SVG.

Peak calling

Now we are ready to call our peaks using MACS2 (or any other program specified by --peak-call-command). MACS2 needs the chromosome sizes for our data, so we will quickly fetch those.

cd ./data
wget "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes"
printf "./data/hg19.chrom.sizes"

./data/hg19.chrom.sizes

Now we can find peaks for every node!

too-many-cells peaks \
  --prior out_min_200 \
  --fragments-path ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \
  --all-nodes \
  --genome ./data/hg19.chrom.sizes \
  --bedgraph \
  -o out_min_200_peaks \
  +RTS -N6

By default, this command will find peaks for all leaves as this is a time-consuming process. --all-nodes tells too-many-peaks to find peaks for every node, including internal ones. We can specify individual nodes with =--peak-node, but we need to also use --all-nodes if that node is internal. If we had labels, we can find peaks for specific subsets of cells in nodes using --peak-node-labels as well!

--bedgraph tells the program to output bdg files for easy plotting of tracks, and +RTS -N6, specified at the end of the command, tells too-many-peaks how many cores to use.

The specified output folder will contain many files, such as:

File Description
out_min_200_peaks/cluster_fragments fragments.tsv.gz files for each node.
out_min_200_peaks/cluster_bedgraphs Bedgraphs and bigwigs if specified using --bedgraph for track visualization uses.
out_min_200_peaks/cluster_peaks/union.bdg Merged peaks across all requested nodes in bedgraph format.
out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz Merged peaks across all requested nodes in fragments.tsv.gz format.
out_min_200_peaks/cluster_peaks/ Folder containing merged peaks across nodes and peaks for each individual node in each folder.

From here, all of the normal too-many-cells processes work on the tree with the union_fragments.tsv.gz file. For instance, to see chr2:87011729-87035519 accessibility on the tree, we can do:

too-many-cells make-tree \
  --prior out_min_200 \
  -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \
  --draw-leaf "DrawItem (DrawContinuous [\"chr2:87011729-87035519\"])" \
  --normalization "LogCPMNorm 2" \
  --custom-region "chr2:87011729-87035519" \
  --draw-mark "MarkModularity" \
  --output out_min_200 \
  > clusters_min_200.csv

For a faster process, you may first want to create a smaller matrix with the regions of interest, then plot those regions in different ways by using the output matrix rather than all of the fragments:

too-many-cells matrix-output \
  -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \
  --normalization NoneNorm \
  --custom-region "chr2:87011729-87035519" \
  --mat-output mini_mat

For more information about the capabilities of visualization and differential expression, check out https://gregoryschwartz.github.io/too-many-cells/!

Identify NK cells

Now that we have a base tree with higher resolution peaks, we can now try searching for known cell populations such as NK cells. While we can use the classify entry point of too-many-cells to link bulk reference data with single-cell data, we will use basic known markers to exemplify the visualization features of the tree. Here, we will only focus on the NKG7 region for NK cells. So, let's look at what that accessibility looks like on the tree at that region, making sure to overlay node numbers for easy reference! For maximum resolution, we'll use the full tree rather than the pruned tree.

too-many-cells make-tree \
  --prior out \
  -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \
  --draw-leaf "DrawItem (DrawContinuous [\"chr19:51874860-51875969\"])" \
  --custom-region "chr19:51874860-51875969" \
  --draw-mark "MarkModularity" \
  --dendrogram-output "NKG7.svg" \
  --draw-node-number \
  --draw-scale-saturation 10 \
  --output out \
  > clusters.csv

printf "./out/NKG7.svg"

Sorry, your browser does not support SVG.

Here, --custom-region tells too-many-peaks to create a new feature within that specific region. We can also use the original fragments to see the accessibility on the tree before peak finding and filtering.

too-many-cells make-tree \
  --prior out \
  -m ./data/pbmc/atac_v1_pbmc_5k_fragments.tsv.gz \
  --draw-leaf "DrawItem (DrawContinuous [\"chr19:51874860-51875969\"])" \
  --custom-region "chr19:51874860-51875969" \
  --draw-mark "MarkModularity" \
  --dendrogram-output "NKG7_raw.svg" \
  --draw-node-number \
  --draw-scale-saturation 10 \
  --output out \
  > clusters.csv

printf "./out/NKG7_raw.svg"

Sorry, your browser does not support SVG.

Based on the coloring and the node number overlay, there seems to be a high level of accessibility within node 85. To further investigate, let's see what the differential accessibility is between node 85 and the rest of the tree (seeing more than the top 100 features and without using edgeR for scATAC-seq):

too-many-cells differential \
  -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \
  --prior out \
  --nodes "([118,1], [85])" \
  --normalization "TotalNorm" \
  --top-n 1000000000 \
  > ./out/NK_vs_other.csv

printf "./out/NK_vs_other.csv"

./out/NK_vs_other.csv

These results are liable to change with the inclusion of --blacklist-regions-file, which should filter out unwanted regions (as noted above). We can also see just our specific region:

too-many-cells differential \
  -m ./out_min_200_peaks/cluster_peaks/union_fragments.tsv.gz \
  --prior out \
  --nodes "([118,1], [85])" \
  --normalization "TotalNorm" \
  --custom-region "chr19:51874860-51875969" \
  > ./out/NK_vs_other_NKG7.csv

printf "./out/NK_vs_other_NKG7.csv"

./out/NK_vs_other_NKG7.csv

As expected, there's some difference at the NKG7 locus. Now we can see what motifs may be enriched in this differential.

Motifs

too-many-peaks can also identify motifs from differential expression analyses using tools such as MEME and HOMER. For instance, with HOMER's findMotifsGenome.pl in your path, you can use the input from too-many-peaks's differential accessibility output from too-many-cells differential that we just calculated to find enriched motifs (getting rid of infinity fold changes, or "divide by zero"):

cat ./out/NK_vs_other.csv | csvsql --query "SELECT * FROM stdin WHERE qVal < 0.05 AND log2FC > 0" | grep -v inf | grep -v Infinity > tmp.csv

too-many-cells motifs \
  --diff-file tmp.csv \
  --motif-genome hg19 \
  --top-n 100000000 \
  -o homer_out \
  --motif-genome-command "findMotifsGenome.pl %s %s %s"

This command outputs motifs in the homer_out directory found using findMotifsGenome.pl on the significant and positive differentially accessible sites from ./out/BLANK_vs_BLANK.csv which was output from our too-many-cells differential run, (so we set --top-n to a high number to include all sites instead of just the top sites).

These kinds of analyses and more are all available using too-many-peaks, which makes full use of the too-many-cells suite of tools so be sure to check it out!

Author: Gregory W. Schwartz

Validate