Ranges#

chame brings ranges operations the scverse ecosystem.

Ranges functionality is currently an experimental feature with its current implementation based on the bioframe library.

In the context of genomics, genomic ranges are defined using coordinates for a specific chromosome and refer to contiguous DNA regions. Some formal introduction into the topic can be found in Lawrence et al., 2013 as well as in the vignettes of the GenomicRanges Bioconductor package. Typical operations on genomic ranges are also illustrated in the bedtools documentation.

Prepare data#

[1]:
import numpy as np
import pandas as pd

import mudata

Load ATAC modality from a multimodal object:

[2]:
from pathlib import Path
data = Path("../../../muon-tutorials/data")
[3]:
adata = mudata.read(str(data / "brain3k_processed.h5mu/atac"))
mdata = mudata.read(str(data / "brain3k_processed.h5mu"))

Import chame:

[4]:
import chame as ch

Filter by genomic ranges#

Original data size:

[5]:
adata.shape
[5]:
(2821, 134028)

Typically AnnData/MuData objects are subsetted (or sliced) along the feature dimension using column indices or column names:

[6]:
adata[:,["chr1:817893-818459", "chr1:818576-819294", "chr1:827064-827944"]]
[6]:
View of AnnData object with n_obs × n_vars = 2821 × 3
    obs: 'n_genes_by_counts', 'total_counts', 'nucleosome_signal', 'tss_score', 'leiden', 'celltype'
    var: 'gene_ids', 'feature_types', 'genome', 'interval', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'atac', 'celltype_colors', 'files', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'lognorm'
    obsp: 'connectivities', 'distances'

Sometimes features are defined in some coordinate system such as a linear genome sequence — for instance in assays that measure chromatin accessibility, transcription factor occupancy, DNA methylation. In fact, genes also have a property defining their location in the DNA, even though that’s something frequently ignored in transcriptomics analysis pipelines.

Subsetting AnnData

In these cases, it is natural to subset the object based on a set of coordinates. This is the interface that chame implements:

[7]:
ch.pp.filter_var_by_ranges(adata, "chr1:900000-1000000").shape
[7]:
(2821, 12)

This functionality relies on genomic intervals stored in the .var table. Intervals can be defined as

  • a single column, e.g. "interval" (default, split into three columns during subsetting), or

  • a set of three columns, e.g. ("chrom", "start", "end") (used by default).

Subsetting works on multiple genomic regions as well:

[8]:
ch.pp.filter_var_by_ranges(adata, [f"chr{c}:900000-1000000" for c in range(22)]).shape
[8]:
(2821, 145)

When a collection of genomic ranges is provided, it is likely to be represented as a NumPy array or a Pandas DataFrame, e.g.

[9]:
query_ranges = pd.DataFrame({
    "chrom": ["chr1", "chr1", "chr2"],
    "start": [1_000_000, 2_000_000, 3_000_000],
    "end": [1_100_000, 2_100_000, 3_100_000],
})
query_ranges
[9]:
chrom start end
0 chr1 1000000 1100000
1 chr1 2000000 2100000
2 chr2 3000000 3100000
[10]:
ch.pp.filter_var_by_ranges(adata, ranges=query_ranges).shape
[10]:
(2821, 36)

Subsetting MuData

Subsetting using genomic ranges works in the same way for MuData objects.

Original MuData size:

[11]:
mdata.shape
[11]:
(2821, 170629)

Subsetting with a single genomic range:

[12]:
ch.pp.filter_var_by_ranges(mdata, "chr1:900000-1000000").shape
[12]:
(2821, 23)

Subsetting with a set of genomic ranges:

[13]:
ch.pp.filter_var_by_ranges(mdata, [f"chr{c}:900000-1000000" for c in range(22)]).shape
[13]:
(2821, 207)

Overlap threshold#

We can also limit the subsetted features to the ones that fully lie in the regions in the query.

In the current example it means that accessibility peaks should be fully covered by the range provided (min_var_coverage=1.0).

[14]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}", min_var_coverage=1.0).var.loc[:,["interval"]].head()
[14]:
interval
chr1:1001576-1002424 chr1:1001576-1002424
chr1:1013015-1013931 chr1:1013015-1013931
chr1:1019259-1020060 chr1:1019259-1020060
chr1:1024711-1025625 chr1:1024711-1025625
chr1:1030994-1031908 chr1:1030994-1031908

Compare this to the subsetting with no min_var_coverage threshold:

[15]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}").var.loc[:,["interval"]].head()
[15]:
interval
chr1:999930-1000654 chr1:999930-1000654
chr1:1001576-1002424 chr1:1001576-1002424
chr1:1013015-1013931 chr1:1013015-1013931
chr1:1019259-1020060 chr1:1019259-1020060
chr1:1024711-1025625 chr1:1024711-1025625

Only a fraction of the first feature’s range here is covered by the provided interval:

[16]:
(1_000_654 - 1_000_000) / (1_000_654 - 999_930)
[16]:
0.9033149171270718

Providing min_var_coverage below this value, e.g. 90% will keep this feature in the ouptut:

[17]:
ch.pp.filter_var_by_ranges(adata, f"chr1:{1_000_000}-{1_100_000}", min_var_coverage=0.90).var.loc[:,["interval"]].head()
[17]:
interval
chr1:999930-1000654 chr1:999930-1000654
chr1:1001576-1002424 chr1:1001576-1002424
chr1:1013015-1013931 chr1:1013015-1013931
chr1:1019259-1020060 chr1:1019259-1020060
chr1:1024711-1025625 chr1:1024711-1025625