10x Genomics I/O#

Data from the scATAC-seq assay can be easily loaded with chame.

[1]:

from chame.io import read_10x

Download data#

chame has a built-in datasets module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.

Original dataset is available here.

[2]:

from chame.data.datasets import pbmc10k_10x_v2
pbmc10k_10x_v2.download(path="data/")

(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated
(2/8) File fragments.tsv.gz exists and checksum is validated
(3/8) File fragments.tsv.gz.tbi exists and checksum is validated
(4/8) File peaks.bed exists and checksum is validated
(5/8) File peak_annotation.tsv exists and checksum is validated
(6/8) File peak_motif_mapping.bed exists and checksum is validated
(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated
(8/8) File singlecell.csv exists and checksum is validated

Reading chromatin accessibility data from 10x Genomics files#

Load data from the downloaded directory. By default, the dataset is loaded into an AnnData object:

[3]:

adata = read_10x("data/pbmc10k_10x_v2/")
adata

[3]:

AnnData object with n_obs × n_vars = 10273 × 164487
    var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'
    uns: 'summary', 'atac', 'files'
    obsm: 'tf'

Peak counts#

Count matrix cells x peaks is accessible via the .X attribute:

[4]:

adata.X

[4]:

<10273x164487 sparse matrix of type '<class 'numpy.float32'>'
        with 106888535 stored elements in Compressed Sparse Row format>

Feature information#

Information about individual peaks is accessible via the .var attribute:

[5]:

adata.var.head()

[5]:

	gene_ids	feature_types	genome	Chromosome	Start	End
chr1:9779-10664	chr1:9779-10664	Peaks	GRCh38	chr1	9779	10664
chr1:180669-181170	chr1:180669-181170	Peaks	GRCh38	chr1	180669	181170
chr1:181245-181570	chr1:181245-181570	Peaks	GRCh38	chr1	181245	181570
chr1:184013-184896	chr1:184013-184896	Peaks	GRCh38	chr1	184013	184896
chr1:191222-192099	chr1:191222-192099	Peaks	GRCh38	chr1	191222	192099

Peak information in .var can be used to construct a PyRanges object on the fly:

[6]:

import pyranges
pyranges.PyRanges(adata.var)

[6]:

+--------------------------+-----------------+------------+--------------+-----------+-----------+
| gene_ids                 | feature_types   | genome     | Chromosome   | Start     | End       |
| (object)                 | (object)        | (object)   | (category)   | (int32)   | (int32)   |
|--------------------------+-----------------+------------+--------------+-----------+-----------|
| GL000194.1:55723-56585   | Peaks           | GRCh38     | GL000194.1   | 55723     | 56585     |
| GL000194.1:58195-58977   | Peaks           | GRCh38     | GL000194.1   | 58195     | 58977     |
| GL000194.1:100995-101880 | Peaks           | GRCh38     | GL000194.1   | 100995    | 101880    |
| GL000194.1:110708-111523 | Peaks           | GRCh38     | GL000194.1   | 110708    | 111523    |
| ...                      | ...             | ...        | ...          | ...       | ...       |
| chrY:26670692-26671520   | Peaks           | GRCh38     | chrY         | 26670692  | 26671520  |
| chrY:56734361-56735235   | Peaks           | GRCh38     | chrY         | 56734361  | 56735235  |
| chrY:56833820-56834655   | Peaks           | GRCh38     | chrY         | 56833820  | 56834655  |
| chrY:56836486-56837375   | Peaks           | GRCh38     | chrY         | 56836486  | 56837375  |
+--------------------------+-----------------+------------+--------------+-----------+-----------+
Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Fragments and peak annotation#

chame detects some default files including peak_annotation.tsv and fragments.tsv.gz:

[7]:

print(
    adata.uns["atac"].keys(),
    adata.uns['files'],
)

dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}

From the peak-motif mapping we can construct a binary peak-motif table:

[8]:

import pandas as pd
pd.get_dummies(
    adata.uns["atac"]["peak_motifs_mapping"].Motif
).head(3)

[8]:

	ALX3_MA0634.1	ARGFX_MA1463.1	ARNT2_MA1464.1	ARNT::HIF1A_MA0259.1	ASCL1(var.2)_MA1631.1	ASCL1_MA1100.2	ATF2_MA1632.1	ATF3_MA0605.2	ATF4_MA0833.2	ATF6_MA1466.1	...	ZNF740_MA0753.2	ZNF75D_MA1601.1	ZSCAN29_MA1602.1	ZSCAN4_MA1155.1	Zfx_MA0146.2	Zic1::Zic2_MA1628.1	Zic2_MA1629.1	Znf281_MA1630.1	Znf423_MA0116.1	mix-a_MA0621.1
Peak
chr1:5982069-5982943	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
chr1:5982069-5982943	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
chr1:5982069-5982943	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

3 rows × 746 columns

Summary statistics#

chame also loads summary statistics for the dataset when available:

[9]:

adata.uns["summary"]

[9]:

{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',
 'Genome': 'GRCh38',
 'Pipeline version': 'cellranger-atac-2.1.0',
 'Estimated number of cells': 10273,
 'Confidently mapped read pairs': 0.931,
 'Estimated bulk library complexity': 369740731.2,
 'Fraction of all fragments in cells': 0.933,
 'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,
 'Fraction of genome in peaks': 0.0451,
 'Fraction of high-quality fragments in cells': 0.9574,
 'Fraction of high-quality fragments overlapping TSS': 0.4601,
 'Fraction of high-quality fragments overlapping peaks': 0.6586,
 'Fraction of transposition events in peaks in cells': 0.6253,
 'Fragments flanking a single nucleosome': 0.4298,
 'Fragments in nucleosome-free regions': 0.4638,
 'Mean raw read pairs per cell': 45448.7244,
 'Median high-quality fragments per cell': 20046.0,
 'Non-nuclear read pairs': 0.0013,
 'Number of peaks': 164487,
 'Percent duplicates': 0.4551,
 'Q30 bases in barcode': 0.9005,
 'Q30 bases in read 1': 0.9536,
 'Q30 bases in read 2': 0.944,
 'Q30 bases in sample index i1': 0.9172,
 'Sequenced read pairs': 466894746,
 'Sequencing saturation': 0.6143,
 'TSS enrichment score': 10.0456,
 'Unmapped read pairs': 0.0097,
 'Valid barcodes': 0.9634}

10x Genomics I/O

Contents