10x Genomics I/O#

Data from the scATAC-seq assay can be easily loaded with chame.

[1]:
from chame.io import read_10x

Download data#

chame has a built-in datasets module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.

Original dataset is available here.

[2]:
from chame.data.datasets import pbmc10k_10x_v2
pbmc10k_10x_v2.download(path="data/")
(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated
(2/8) File fragments.tsv.gz exists and checksum is validated
(3/8) File fragments.tsv.gz.tbi exists and checksum is validated
(4/8) File peaks.bed exists and checksum is validated
(5/8) File peak_annotation.tsv exists and checksum is validated
(6/8) File peak_motif_mapping.bed exists and checksum is validated
(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated
(8/8) File singlecell.csv exists and checksum is validated

Reading chromatin accessibility data from 10x Genomics files#

Load data from the downloaded directory. By default, the dataset is loaded into an AnnData object:

[3]:
adata = read_10x("data/pbmc10k_10x_v2/")
adata
[3]:
AnnData object with n_obs × n_vars = 10273 × 164487
    var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'
    uns: 'summary', 'atac', 'files'
    obsm: 'tf'

Peak counts#

Count matrix cells x peaks is accessible via the .X attribute:

[4]:
adata.X
[4]:
<10273x164487 sparse matrix of type '<class 'numpy.float32'>'
        with 106888535 stored elements in Compressed Sparse Row format>

Feature information#

Information about individual peaks is accessible via the .var attribute:

[5]:
adata.var.head()
[5]:
gene_ids feature_types genome Chromosome Start End
chr1:9779-10664 chr1:9779-10664 Peaks GRCh38 chr1 9779 10664
chr1:180669-181170 chr1:180669-181170 Peaks GRCh38 chr1 180669 181170
chr1:181245-181570 chr1:181245-181570 Peaks GRCh38 chr1 181245 181570
chr1:184013-184896 chr1:184013-184896 Peaks GRCh38 chr1 184013 184896
chr1:191222-192099 chr1:191222-192099 Peaks GRCh38 chr1 191222 192099

Peak information in .var can be used to construct a PyRanges object on the fly:

[6]:
import pyranges
pyranges.PyRanges(adata.var)
[6]:
+--------------------------+-----------------+------------+--------------+-----------+-----------+
| gene_ids                 | feature_types   | genome     | Chromosome   | Start     | End       |
| (object)                 | (object)        | (object)   | (category)   | (int32)   | (int32)   |
|--------------------------+-----------------+------------+--------------+-----------+-----------|
| GL000194.1:55723-56585   | Peaks           | GRCh38     | GL000194.1   | 55723     | 56585     |
| GL000194.1:58195-58977   | Peaks           | GRCh38     | GL000194.1   | 58195     | 58977     |
| GL000194.1:100995-101880 | Peaks           | GRCh38     | GL000194.1   | 100995    | 101880    |
| GL000194.1:110708-111523 | Peaks           | GRCh38     | GL000194.1   | 110708    | 111523    |
| ...                      | ...             | ...        | ...          | ...       | ...       |
| chrY:26670692-26671520   | Peaks           | GRCh38     | chrY         | 26670692  | 26671520  |
| chrY:56734361-56735235   | Peaks           | GRCh38     | chrY         | 56734361  | 56735235  |
| chrY:56833820-56834655   | Peaks           | GRCh38     | chrY         | 56833820  | 56834655  |
| chrY:56836486-56837375   | Peaks           | GRCh38     | chrY         | 56836486  | 56837375  |
+--------------------------+-----------------+------------+--------------+-----------+-----------+
Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Fragments and peak annotation#

chame detects some default files including peak_annotation.tsv and fragments.tsv.gz:

[7]:
print(
    adata.uns["atac"].keys(),
    adata.uns['files'],
)
dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}

From the peak-motif mapping we can construct a binary peak-motif table:

[8]:
import pandas as pd
pd.get_dummies(
    adata.uns["atac"]["peak_motifs_mapping"].Motif
).head(3)
[8]:
ALX3_MA0634.1 ARGFX_MA1463.1 ARNT2_MA1464.1 ARNT::HIF1A_MA0259.1 ASCL1(var.2)_MA1631.1 ASCL1_MA1100.2 ATF2_MA1632.1 ATF3_MA0605.2 ATF4_MA0833.2 ATF6_MA1466.1 ... ZNF740_MA0753.2 ZNF75D_MA1601.1 ZSCAN29_MA1602.1 ZSCAN4_MA1155.1 Zfx_MA0146.2 Zic1::Zic2_MA1628.1 Zic2_MA1629.1 Znf281_MA1630.1 Znf423_MA0116.1 mix-a_MA0621.1
Peak
chr1:5982069-5982943 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
chr1:5982069-5982943 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
chr1:5982069-5982943 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 746 columns

Summary statistics#

chame also loads summary statistics for the dataset when available:

[9]:
adata.uns["summary"]
[9]:
{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',
 'Genome': 'GRCh38',
 'Pipeline version': 'cellranger-atac-2.1.0',
 'Estimated number of cells': 10273,
 'Confidently mapped read pairs': 0.931,
 'Estimated bulk library complexity': 369740731.2,
 'Fraction of all fragments in cells': 0.933,
 'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,
 'Fraction of genome in peaks': 0.0451,
 'Fraction of high-quality fragments in cells': 0.9574,
 'Fraction of high-quality fragments overlapping TSS': 0.4601,
 'Fraction of high-quality fragments overlapping peaks': 0.6586,
 'Fraction of transposition events in peaks in cells': 0.6253,
 'Fragments flanking a single nucleosome': 0.4298,
 'Fragments in nucleosome-free regions': 0.4638,
 'Mean raw read pairs per cell': 45448.7244,
 'Median high-quality fragments per cell': 20046.0,
 'Non-nuclear read pairs': 0.0013,
 'Number of peaks': 164487,
 'Percent duplicates': 0.4551,
 'Q30 bases in barcode': 0.9005,
 'Q30 bases in read 1': 0.9536,
 'Q30 bases in read 2': 0.944,
 'Q30 bases in sample index i1': 0.9172,
 'Sequenced read pairs': 466894746,
 'Sequencing saturation': 0.6143,
 'TSS enrichment score': 10.0456,
 'Unmapped read pairs': 0.0097,
 'Valid barcodes': 0.9634}