10x Genomics I/O
Contents
10x Genomics I/O#
Data from the scATAC-seq assay can be easily loaded with chame
.
[1]:
from chame.io import read_10x
Download data#
chame
has a built-in datasets
module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.
Original dataset is available here.
[2]:
from chame.data.datasets import pbmc10k_10x_v2
pbmc10k_10x_v2.download(path="data/")
(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated
(2/8) File fragments.tsv.gz exists and checksum is validated
(3/8) File fragments.tsv.gz.tbi exists and checksum is validated
(4/8) File peaks.bed exists and checksum is validated
(5/8) File peak_annotation.tsv exists and checksum is validated
(6/8) File peak_motif_mapping.bed exists and checksum is validated
(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated
(8/8) File singlecell.csv exists and checksum is validated
Reading chromatin accessibility data from 10x Genomics files#
Load data from the downloaded directory. By default, the dataset is loaded into an AnnData object:
[3]:
adata = read_10x("data/pbmc10k_10x_v2/")
adata
[3]:
AnnData object with n_obs × n_vars = 10273 × 164487
var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'
uns: 'summary', 'atac', 'files'
obsm: 'tf'
Peak counts#
Count matrix cells x peaks
is accessible via the .X
attribute:
[4]:
adata.X
[4]:
<10273x164487 sparse matrix of type '<class 'numpy.float32'>'
with 106888535 stored elements in Compressed Sparse Row format>
Feature information#
Information about individual peaks is accessible via the .var
attribute:
[5]:
adata.var.head()
[5]:
gene_ids | feature_types | genome | Chromosome | Start | End | |
---|---|---|---|---|---|---|
chr1:9779-10664 | chr1:9779-10664 | Peaks | GRCh38 | chr1 | 9779 | 10664 |
chr1:180669-181170 | chr1:180669-181170 | Peaks | GRCh38 | chr1 | 180669 | 181170 |
chr1:181245-181570 | chr1:181245-181570 | Peaks | GRCh38 | chr1 | 181245 | 181570 |
chr1:184013-184896 | chr1:184013-184896 | Peaks | GRCh38 | chr1 | 184013 | 184896 |
chr1:191222-192099 | chr1:191222-192099 | Peaks | GRCh38 | chr1 | 191222 | 192099 |
Peak information in .var
can be used to construct a PyRanges object on the fly:
[6]:
import pyranges
pyranges.PyRanges(adata.var)
[6]:
+--------------------------+-----------------+------------+--------------+-----------+-----------+
| gene_ids | feature_types | genome | Chromosome | Start | End |
| (object) | (object) | (object) | (category) | (int32) | (int32) |
|--------------------------+-----------------+------------+--------------+-----------+-----------|
| GL000194.1:55723-56585 | Peaks | GRCh38 | GL000194.1 | 55723 | 56585 |
| GL000194.1:58195-58977 | Peaks | GRCh38 | GL000194.1 | 58195 | 58977 |
| GL000194.1:100995-101880 | Peaks | GRCh38 | GL000194.1 | 100995 | 101880 |
| GL000194.1:110708-111523 | Peaks | GRCh38 | GL000194.1 | 110708 | 111523 |
| ... | ... | ... | ... | ... | ... |
| chrY:26670692-26671520 | Peaks | GRCh38 | chrY | 26670692 | 26671520 |
| chrY:56734361-56735235 | Peaks | GRCh38 | chrY | 56734361 | 56735235 |
| chrY:56833820-56834655 | Peaks | GRCh38 | chrY | 56833820 | 56834655 |
| chrY:56836486-56837375 | Peaks | GRCh38 | chrY | 56836486 | 56837375 |
+--------------------------+-----------------+------------+--------------+-----------+-----------+
Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
Fragments and peak annotation#
chame
detects some default files including peak_annotation.tsv
and fragments.tsv.gz
:
[7]:
print(
adata.uns["atac"].keys(),
adata.uns['files'],
)
dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}
From the peak-motif mapping we can construct a binary peak-motif table:
[8]:
import pandas as pd
pd.get_dummies(
adata.uns["atac"]["peak_motifs_mapping"].Motif
).head(3)
[8]:
ALX3_MA0634.1 | ARGFX_MA1463.1 | ARNT2_MA1464.1 | ARNT::HIF1A_MA0259.1 | ASCL1(var.2)_MA1631.1 | ASCL1_MA1100.2 | ATF2_MA1632.1 | ATF3_MA0605.2 | ATF4_MA0833.2 | ATF6_MA1466.1 | ... | ZNF740_MA0753.2 | ZNF75D_MA1601.1 | ZSCAN29_MA1602.1 | ZSCAN4_MA1155.1 | Zfx_MA0146.2 | Zic1::Zic2_MA1628.1 | Zic2_MA1629.1 | Znf281_MA1630.1 | Znf423_MA0116.1 | mix-a_MA0621.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Peak | |||||||||||||||||||||
chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
chr1:5982069-5982943 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 746 columns
Summary statistics#
chame
also loads summary statistics for the dataset when available:
[9]:
adata.uns["summary"]
[9]:
{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',
'Genome': 'GRCh38',
'Pipeline version': 'cellranger-atac-2.1.0',
'Estimated number of cells': 10273,
'Confidently mapped read pairs': 0.931,
'Estimated bulk library complexity': 369740731.2,
'Fraction of all fragments in cells': 0.933,
'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,
'Fraction of genome in peaks': 0.0451,
'Fraction of high-quality fragments in cells': 0.9574,
'Fraction of high-quality fragments overlapping TSS': 0.4601,
'Fraction of high-quality fragments overlapping peaks': 0.6586,
'Fraction of transposition events in peaks in cells': 0.6253,
'Fragments flanking a single nucleosome': 0.4298,
'Fragments in nucleosome-free regions': 0.4638,
'Mean raw read pairs per cell': 45448.7244,
'Median high-quality fragments per cell': 20046.0,
'Non-nuclear read pairs': 0.0013,
'Number of peaks': 164487,
'Percent duplicates': 0.4551,
'Q30 bases in barcode': 0.9005,
'Q30 bases in read 1': 0.9536,
'Q30 bases in read 2': 0.944,
'Q30 bases in sample index i1': 0.9172,
'Sequenced read pairs': 466894746,
'Sequencing saturation': 0.6143,
'TSS enrichment score': 10.0456,
'Unmapped read pairs': 0.0097,
'Valid barcodes': 0.9634}