10x Genomics I/O
Contents
10x Genomics I/O#
Data from the scATAC-seq assay can be easily loaded with chame.
[1]:
from chame.io import read_10x
Download data#
chame has a built-in datasets module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.
Original dataset is available here.
[2]:
from chame.data.datasets import pbmc10k_10x_v2
pbmc10k_10x_v2.download(path="data/")
(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated
(2/8) File fragments.tsv.gz exists and checksum is validated
(3/8) File fragments.tsv.gz.tbi exists and checksum is validated
(4/8) File peaks.bed exists and checksum is validated
(5/8) File peak_annotation.tsv exists and checksum is validated
(6/8) File peak_motif_mapping.bed exists and checksum is validated
(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated
(8/8) File singlecell.csv exists and checksum is validated
Reading chromatin accessibility data from 10x Genomics files#
Load data from the downloaded directory. By default, the dataset is loaded into an AnnData object:
[3]:
adata = read_10x("data/pbmc10k_10x_v2/")
adata
[3]:
AnnData object with n_obs × n_vars = 10273 × 164487
    var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'
    uns: 'summary', 'atac', 'files'
    obsm: 'tf'
Peak counts#
Count matrix cells x peaks is accessible via the .X attribute:
[4]:
adata.X
[4]:
<10273x164487 sparse matrix of type '<class 'numpy.float32'>'
        with 106888535 stored elements in Compressed Sparse Row format>
Feature information#
Information about individual peaks is accessible via the .var attribute:
[5]:
adata.var.head()
[5]:
| gene_ids | feature_types | genome | Chromosome | Start | End | |
|---|---|---|---|---|---|---|
| chr1:9779-10664 | chr1:9779-10664 | Peaks | GRCh38 | chr1 | 9779 | 10664 | 
| chr1:180669-181170 | chr1:180669-181170 | Peaks | GRCh38 | chr1 | 180669 | 181170 | 
| chr1:181245-181570 | chr1:181245-181570 | Peaks | GRCh38 | chr1 | 181245 | 181570 | 
| chr1:184013-184896 | chr1:184013-184896 | Peaks | GRCh38 | chr1 | 184013 | 184896 | 
| chr1:191222-192099 | chr1:191222-192099 | Peaks | GRCh38 | chr1 | 191222 | 192099 | 
Peak information in .var can be used to construct a PyRanges object on the fly:
[6]:
import pyranges
pyranges.PyRanges(adata.var)
[6]:
+--------------------------+-----------------+------------+--------------+-----------+-----------+
| gene_ids                 | feature_types   | genome     | Chromosome   | Start     | End       |
| (object)                 | (object)        | (object)   | (category)   | (int32)   | (int32)   |
|--------------------------+-----------------+------------+--------------+-----------+-----------|
| GL000194.1:55723-56585   | Peaks           | GRCh38     | GL000194.1   | 55723     | 56585     |
| GL000194.1:58195-58977   | Peaks           | GRCh38     | GL000194.1   | 58195     | 58977     |
| GL000194.1:100995-101880 | Peaks           | GRCh38     | GL000194.1   | 100995    | 101880    |
| GL000194.1:110708-111523 | Peaks           | GRCh38     | GL000194.1   | 110708    | 111523    |
| ...                      | ...             | ...        | ...          | ...       | ...       |
| chrY:26670692-26671520   | Peaks           | GRCh38     | chrY         | 26670692  | 26671520  |
| chrY:56734361-56735235   | Peaks           | GRCh38     | chrY         | 56734361  | 56735235  |
| chrY:56833820-56834655   | Peaks           | GRCh38     | chrY         | 56833820  | 56834655  |
| chrY:56836486-56837375   | Peaks           | GRCh38     | chrY         | 56836486  | 56837375  |
+--------------------------+-----------------+------------+--------------+-----------+-----------+
Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
Fragments and peak annotation#
chame detects some default files including peak_annotation.tsv and fragments.tsv.gz:
[7]:
print(
    adata.uns["atac"].keys(),
    adata.uns['files'],
)
dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}
From the peak-motif mapping we can construct a binary peak-motif table:
[8]:
import pandas as pd
pd.get_dummies(
    adata.uns["atac"]["peak_motifs_mapping"].Motif
).head(3)
[8]:
| ALX3_MA0634.1 | ARGFX_MA1463.1 | ARNT2_MA1464.1 | ARNT::HIF1A_MA0259.1 | ASCL1(var.2)_MA1631.1 | ASCL1_MA1100.2 | ATF2_MA1632.1 | ATF3_MA0605.2 | ATF4_MA0833.2 | ATF6_MA1466.1 | ... | ZNF740_MA0753.2 | ZNF75D_MA1601.1 | ZSCAN29_MA1602.1 | ZSCAN4_MA1155.1 | Zfx_MA0146.2 | Zic1::Zic2_MA1628.1 | Zic2_MA1629.1 | Znf281_MA1630.1 | Znf423_MA0116.1 | mix-a_MA0621.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Peak | |||||||||||||||||||||
| chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
3 rows × 746 columns
Summary statistics#
chame also loads summary statistics for the dataset when available:
[9]:
adata.uns["summary"]
[9]:
{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',
 'Genome': 'GRCh38',
 'Pipeline version': 'cellranger-atac-2.1.0',
 'Estimated number of cells': 10273,
 'Confidently mapped read pairs': 0.931,
 'Estimated bulk library complexity': 369740731.2,
 'Fraction of all fragments in cells': 0.933,
 'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,
 'Fraction of genome in peaks': 0.0451,
 'Fraction of high-quality fragments in cells': 0.9574,
 'Fraction of high-quality fragments overlapping TSS': 0.4601,
 'Fraction of high-quality fragments overlapping peaks': 0.6586,
 'Fraction of transposition events in peaks in cells': 0.6253,
 'Fragments flanking a single nucleosome': 0.4298,
 'Fragments in nucleosome-free regions': 0.4638,
 'Mean raw read pairs per cell': 45448.7244,
 'Median high-quality fragments per cell': 20046.0,
 'Non-nuclear read pairs': 0.0013,
 'Number of peaks': 164487,
 'Percent duplicates': 0.4551,
 'Q30 bases in barcode': 0.9005,
 'Q30 bases in read 1': 0.9536,
 'Q30 bases in read 2': 0.944,
 'Q30 bases in sample index i1': 0.9172,
 'Sequenced read pairs': 466894746,
 'Sequencing saturation': 0.6143,
 'TSS enrichment score': 10.0456,
 'Unmapped read pairs': 0.0097,
 'Valid barcodes': 0.9634}