Ranges
Contents
Ranges#
chame
brings ranges operations the scverse ecosystem.
Ranges functionality is currently an experimental feature with its current implementation based on the bioframe library.
In the context of genomics, genomic ranges are defined using coordinates for a specific chromosome and refer to contiguous DNA regions. Some formal introduction into the topic can be found in Lawrence et al., 2013 as well as in the vignettes of the GenomicRanges Bioconductor package. Typical operations on genomic ranges are also illustrated in the bedtools documentation.
Prepare data#
[1]:
import numpy as np
import pandas as pd
import mudata
Load ATAC modality from a multimodal object:
[2]:
from pathlib import Path
data = Path("../../../muon-tutorials/data")
[3]:
adata = mudata.read(str(data / "brain3k_processed.h5mu/atac"))
mdata = mudata.read(str(data / "brain3k_processed.h5mu"))
Import chame
:
[4]:
import chame as ch
Filter by genomic ranges#
Original data size:
[5]:
adata.shape
[5]:
(2821, 134028)
Typically AnnData/MuData objects are subsetted (or sliced) along the feature dimension using column indices or column names:
[6]:
adata[:,["chr1:817893-818459", "chr1:818576-819294", "chr1:827064-827944"]]
[6]:
View of AnnData object with n_obs × n_vars = 2821 × 3
obs: 'n_genes_by_counts', 'total_counts', 'nucleosome_signal', 'tss_score', 'leiden', 'celltype'
var: 'gene_ids', 'feature_types', 'genome', 'interval', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'atac', 'celltype_colors', 'files', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'counts', 'lognorm'
obsp: 'connectivities', 'distances'
Sometimes features are defined in some coordinate system such as a linear genome sequence — for instance in assays that measure chromatin accessibility, transcription factor occupancy, DNA methylation. In fact, genes also have a property defining their location in the DNA, even though that’s something frequently ignored in transcriptomics analysis pipelines.
Subsetting AnnData
In these cases, it is natural to subset the object based on a set of coordinates. This is the interface that chame
implements:
[7]:
ch.pp.filter_var_by_ranges(adata, "chr1:900000-1000000").shape
[7]:
(2821, 12)
This functionality relies on genomic intervals stored in the .var
table. Intervals can be defined as
a single column, e.g.
"interval"
(default, split into three columns during subsetting), ora set of three columns, e.g.
("chrom", "start", "end")
(used by default).
Subsetting works on multiple genomic regions as well:
[8]:
ch.pp.filter_var_by_ranges(adata, [f"chr{c}:900000-1000000" for c in range(22)]).shape
[8]:
(2821, 145)
When a collection of genomic ranges is provided, it is likely to be represented as a NumPy array or a Pandas DataFrame, e.g.
[9]:
query_ranges = pd.DataFrame({
"chrom": ["chr1", "chr1", "chr2"],
"start": [1_000_000, 2_000_000, 3_000_000],
"end": [1_100_000, 2_100_000, 3_100_000],
})
query_ranges
[9]:
chrom | start | end | |
---|---|---|---|
0 | chr1 | 1000000 | 1100000 |
1 | chr1 | 2000000 | 2100000 |
2 | chr2 | 3000000 | 3100000 |
[10]:
ch.pp.filter_var_by_ranges(adata, ranges=query_ranges).shape
[10]:
(2821, 36)
Subsetting MuData
Subsetting using genomic ranges works in the same way for MuData objects.
Original MuData size:
[11]:
mdata.shape
[11]:
(2821, 170629)
Subsetting with a single genomic range:
[12]:
ch.pp.filter_var_by_ranges(mdata, "chr1:900000-1000000").shape
[12]:
(2821, 23)
Subsetting with a set of genomic ranges:
[13]:
ch.pp.filter_var_by_ranges(mdata, [f"chr{c}:900000-1000000" for c in range(22)]).shape
[13]:
(2821, 207)
Overlap threshold#
We can also limit the subsetted features to the ones that fully lie in the regions in the query.
In the current example it means that accessibility peaks should be fully covered by the range provided (min_var_coverage=1.0
).
[14]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}", min_var_coverage=1.0).var.loc[:,["interval"]].head()
[14]:
interval | |
---|---|
chr1:1001576-1002424 | chr1:1001576-1002424 |
chr1:1013015-1013931 | chr1:1013015-1013931 |
chr1:1019259-1020060 | chr1:1019259-1020060 |
chr1:1024711-1025625 | chr1:1024711-1025625 |
chr1:1030994-1031908 | chr1:1030994-1031908 |
Compare this to the subsetting with no min_var_coverage
threshold:
[15]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}").var.loc[:,["interval"]].head()
[15]:
interval | |
---|---|
chr1:999930-1000654 | chr1:999930-1000654 |
chr1:1001576-1002424 | chr1:1001576-1002424 |
chr1:1013015-1013931 | chr1:1013015-1013931 |
chr1:1019259-1020060 | chr1:1019259-1020060 |
chr1:1024711-1025625 | chr1:1024711-1025625 |
Only a fraction of the first feature’s range here is covered by the provided interval:
[16]:
(1_000_654 - 1_000_000) / (1_000_654 - 999_930)
[16]:
0.9033149171270718
Providing min_var_coverage
below this value, e.g. 90%
will keep this feature in the ouptut:
[17]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}", min_var_coverage=0.90).var.loc[:,["interval"]].head()
[17]:
interval | |
---|---|
chr1:999930-1000654 | chr1:999930-1000654 |
chr1:1001576-1002424 | chr1:1001576-1002424 |
chr1:1013015-1013931 | chr1:1013015-1013931 |
chr1:1019259-1020060 | chr1:1019259-1020060 |
chr1:1024711-1025625 | chr1:1024711-1025625 |