{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 10x Genomics I/O" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data from the [scATAC-seq](https://www.10xgenomics.com/products/single-cell-atac) assay can be easily loaded with `chame`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from chame.io import read_10x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`chame` has a built-in `datasets` module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.\n", "\n", "Original dataset is available [here](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-x-2-standard)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated\n", "(2/8) File fragments.tsv.gz exists and checksum is validated\n", "(3/8) File fragments.tsv.gz.tbi exists and checksum is validated\n", "(4/8) File peaks.bed exists and checksum is validated\n", "(5/8) File peak_annotation.tsv exists and checksum is validated\n", "(6/8) File peak_motif_mapping.bed exists and checksum is validated\n", "(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated\n", "(8/8) File singlecell.csv exists and checksum is validated\n" ] } ], "source": [ "from chame.data.datasets import pbmc10k_10x_v2\n", "pbmc10k_10x_v2.download(path=\"data/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading chromatin accessibility data from 10x Genomics files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load data from the downloaded directory. By default, the dataset is loaded into [an AnnData object](https://github.com/scverse/anndata):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 10273 × 164487\n", " var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'\n", " uns: 'summary', 'atac', 'files'\n", " obsm: 'tf'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata = read_10x(\"data/pbmc10k_10x_v2/\")\n", "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Peak counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count matrix `cells x peaks` is accessible via the `.X` attribute:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<10273x164487 sparse matrix of type ''\n", "\twith 106888535 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Feature information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Information about individual peaks is accessible via the `.var` attribute:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_idsfeature_typesgenomeChromosomeStartEnd
chr1:9779-10664chr1:9779-10664PeaksGRCh38chr1977910664
chr1:180669-181170chr1:180669-181170PeaksGRCh38chr1180669181170
chr1:181245-181570chr1:181245-181570PeaksGRCh38chr1181245181570
chr1:184013-184896chr1:184013-184896PeaksGRCh38chr1184013184896
chr1:191222-192099chr1:191222-192099PeaksGRCh38chr1191222192099
\n", "
" ], "text/plain": [ " gene_ids feature_types genome Chromosome \\\n", "chr1:9779-10664 chr1:9779-10664 Peaks GRCh38 chr1 \n", "chr1:180669-181170 chr1:180669-181170 Peaks GRCh38 chr1 \n", "chr1:181245-181570 chr1:181245-181570 Peaks GRCh38 chr1 \n", "chr1:184013-184896 chr1:184013-184896 Peaks GRCh38 chr1 \n", "chr1:191222-192099 chr1:191222-192099 Peaks GRCh38 chr1 \n", "\n", " Start End \n", "chr1:9779-10664 9779 10664 \n", "chr1:180669-181170 180669 181170 \n", "chr1:181245-181570 181245 181570 \n", "chr1:184013-184896 184013 184896 \n", "chr1:191222-192099 191222 192099 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.var.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Peak information in `.var` can be used to construct a [PyRanges](https://github.com/biocore-ntnu/pyranges) object on the fly:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+--------------------------+-----------------+------------+--------------+-----------+-----------+\n", "| gene_ids | feature_types | genome | Chromosome | Start | End |\n", "| (object) | (object) | (object) | (category) | (int32) | (int32) |\n", "|--------------------------+-----------------+------------+--------------+-----------+-----------|\n", "| GL000194.1:55723-56585 | Peaks | GRCh38 | GL000194.1 | 55723 | 56585 |\n", "| GL000194.1:58195-58977 | Peaks | GRCh38 | GL000194.1 | 58195 | 58977 |\n", "| GL000194.1:100995-101880 | Peaks | GRCh38 | GL000194.1 | 100995 | 101880 |\n", "| GL000194.1:110708-111523 | Peaks | GRCh38 | GL000194.1 | 110708 | 111523 |\n", "| ... | ... | ... | ... | ... | ... |\n", "| chrY:26670692-26671520 | Peaks | GRCh38 | chrY | 26670692 | 26671520 |\n", "| chrY:56734361-56735235 | Peaks | GRCh38 | chrY | 56734361 | 56735235 |\n", "| chrY:56833820-56834655 | Peaks | GRCh38 | chrY | 56833820 | 56834655 |\n", "| chrY:56836486-56837375 | Peaks | GRCh38 | chrY | 56836486 | 56837375 |\n", "+--------------------------+-----------------+------------+--------------+-----------+-----------+\n", "Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.\n", "For printing, the PyRanges was sorted on Chromosome." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyranges\n", "pyranges.PyRanges(adata.var)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Fragments and peak annotation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`chame` detects some default files including `peak_annotation.tsv` and `fragments.tsv.gz`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}\n" ] } ], "source": [ "print(\n", " adata.uns[\"atac\"].keys(),\n", " adata.uns['files'],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the peak-motif mapping we can construct a binary peak-motif table:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ALX3_MA0634.1ARGFX_MA1463.1ARNT2_MA1464.1ARNT::HIF1A_MA0259.1ASCL1(var.2)_MA1631.1ASCL1_MA1100.2ATF2_MA1632.1ATF3_MA0605.2ATF4_MA0833.2ATF6_MA1466.1...ZNF740_MA0753.2ZNF75D_MA1601.1ZSCAN29_MA1602.1ZSCAN4_MA1155.1Zfx_MA0146.2Zic1::Zic2_MA1628.1Zic2_MA1629.1Znf281_MA1630.1Znf423_MA0116.1mix-a_MA0621.1
Peak
chr1:5982069-59829430000000000...0000000000
chr1:5982069-59829430000000000...0000000000
chr1:5982069-59829430000000000...0000000000
\n", "

3 rows × 746 columns

\n", "
" ], "text/plain": [ " ALX3_MA0634.1 ARGFX_MA1463.1 ARNT2_MA1464.1 \\\n", "Peak \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "\n", " ARNT::HIF1A_MA0259.1 ASCL1(var.2)_MA1631.1 \\\n", "Peak \n", "chr1:5982069-5982943 0 0 \n", "chr1:5982069-5982943 0 0 \n", "chr1:5982069-5982943 0 0 \n", "\n", " ASCL1_MA1100.2 ATF2_MA1632.1 ATF3_MA0605.2 \\\n", "Peak \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "\n", " ATF4_MA0833.2 ATF6_MA1466.1 ... ZNF740_MA0753.2 \\\n", "Peak ... \n", "chr1:5982069-5982943 0 0 ... 0 \n", "chr1:5982069-5982943 0 0 ... 0 \n", "chr1:5982069-5982943 0 0 ... 0 \n", "\n", " ZNF75D_MA1601.1 ZSCAN29_MA1602.1 ZSCAN4_MA1155.1 \\\n", "Peak \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "\n", " Zfx_MA0146.2 Zic1::Zic2_MA1628.1 Zic2_MA1629.1 \\\n", "Peak \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "\n", " Znf281_MA1630.1 Znf423_MA0116.1 mix-a_MA0621.1 \n", "Peak \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "chr1:5982069-5982943 0 0 0 \n", "\n", "[3 rows x 746 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.get_dummies(\n", " adata.uns[\"atac\"][\"peak_motifs_mapping\"].Motif\n", ").head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Summary statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`chame` also loads summary statistics for the dataset when available:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',\n", " 'Genome': 'GRCh38',\n", " 'Pipeline version': 'cellranger-atac-2.1.0',\n", " 'Estimated number of cells': 10273,\n", " 'Confidently mapped read pairs': 0.931,\n", " 'Estimated bulk library complexity': 369740731.2,\n", " 'Fraction of all fragments in cells': 0.933,\n", " 'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,\n", " 'Fraction of genome in peaks': 0.0451,\n", " 'Fraction of high-quality fragments in cells': 0.9574,\n", " 'Fraction of high-quality fragments overlapping TSS': 0.4601,\n", " 'Fraction of high-quality fragments overlapping peaks': 0.6586,\n", " 'Fraction of transposition events in peaks in cells': 0.6253,\n", " 'Fragments flanking a single nucleosome': 0.4298,\n", " 'Fragments in nucleosome-free regions': 0.4638,\n", " 'Mean raw read pairs per cell': 45448.7244,\n", " 'Median high-quality fragments per cell': 20046.0,\n", " 'Non-nuclear read pairs': 0.0013,\n", " 'Number of peaks': 164487,\n", " 'Percent duplicates': 0.4551,\n", " 'Q30 bases in barcode': 0.9005,\n", " 'Q30 bases in read 1': 0.9536,\n", " 'Q30 bases in read 2': 0.944,\n", " 'Q30 bases in sample index i1': 0.9172,\n", " 'Sequenced read pairs': 466894746,\n", " 'Sequencing saturation': 0.6143,\n", " 'TSS enrichment score': 10.0456,\n", " 'Unmapped read pairs': 0.0097,\n", " 'Valid barcodes': 0.9634}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.uns[\"summary\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }