{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10x Genomics I/O"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data from the [scATAC-seq](https://www.10xgenomics.com/products/single-cell-atac) assay can be easily loaded with `chame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from chame.io import read_10x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`chame` has a built-in `datasets` module to donwload some datasets such as 10k PBMCs profiled with scATAC-seq.\n",
    "\n",
    "Original dataset is available [here](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-x-2-standard)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1/8) File filtered_peak_bc_matrix.h5 exists and checksum is validated\n",
      "(2/8) File fragments.tsv.gz exists and checksum is validated\n",
      "(3/8) File fragments.tsv.gz.tbi exists and checksum is validated\n",
      "(4/8) File peaks.bed exists and checksum is validated\n",
      "(5/8) File peak_annotation.tsv exists and checksum is validated\n",
      "(6/8) File peak_motif_mapping.bed exists and checksum is validated\n",
      "(7/8) File filtered_tf_bc_matrix.h5 exists and checksum is validated\n",
      "(8/8) File singlecell.csv exists and checksum is validated\n"
     ]
    }
   ],
   "source": [
    "from chame.data.datasets import pbmc10k_10x_v2\n",
    "pbmc10k_10x_v2.download(path=\"data/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading chromatin accessibility data from 10x Genomics files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load data from the downloaded directory. By default, the dataset is loaded into [an AnnData object](https://github.com/scverse/anndata):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 10273 × 164487\n",
       "    var: 'gene_ids', 'feature_types', 'genome', 'Chromosome', 'Start', 'End'\n",
       "    uns: 'summary', 'atac', 'files'\n",
       "    obsm: 'tf'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata = read_10x(\"data/pbmc10k_10x_v2/\")\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Peak counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Count matrix `cells x peaks` is accessible via the `.X` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<10273x164487 sparse matrix of type '<class 'numpy.float32'>'\n",
       "\twith 106888535 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Feature information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Information about individual peaks is accessible via the `.var` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_ids</th>\n",
       "      <th>feature_types</th>\n",
       "      <th>genome</th>\n",
       "      <th>Chromosome</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chr1:9779-10664</th>\n",
       "      <td>chr1:9779-10664</td>\n",
       "      <td>Peaks</td>\n",
       "      <td>GRCh38</td>\n",
       "      <td>chr1</td>\n",
       "      <td>9779</td>\n",
       "      <td>10664</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:180669-181170</th>\n",
       "      <td>chr1:180669-181170</td>\n",
       "      <td>Peaks</td>\n",
       "      <td>GRCh38</td>\n",
       "      <td>chr1</td>\n",
       "      <td>180669</td>\n",
       "      <td>181170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:181245-181570</th>\n",
       "      <td>chr1:181245-181570</td>\n",
       "      <td>Peaks</td>\n",
       "      <td>GRCh38</td>\n",
       "      <td>chr1</td>\n",
       "      <td>181245</td>\n",
       "      <td>181570</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:184013-184896</th>\n",
       "      <td>chr1:184013-184896</td>\n",
       "      <td>Peaks</td>\n",
       "      <td>GRCh38</td>\n",
       "      <td>chr1</td>\n",
       "      <td>184013</td>\n",
       "      <td>184896</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:191222-192099</th>\n",
       "      <td>chr1:191222-192099</td>\n",
       "      <td>Peaks</td>\n",
       "      <td>GRCh38</td>\n",
       "      <td>chr1</td>\n",
       "      <td>191222</td>\n",
       "      <td>192099</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              gene_ids feature_types  genome Chromosome  \\\n",
       "chr1:9779-10664        chr1:9779-10664         Peaks  GRCh38       chr1   \n",
       "chr1:180669-181170  chr1:180669-181170         Peaks  GRCh38       chr1   \n",
       "chr1:181245-181570  chr1:181245-181570         Peaks  GRCh38       chr1   \n",
       "chr1:184013-184896  chr1:184013-184896         Peaks  GRCh38       chr1   \n",
       "chr1:191222-192099  chr1:191222-192099         Peaks  GRCh38       chr1   \n",
       "\n",
       "                     Start     End  \n",
       "chr1:9779-10664       9779   10664  \n",
       "chr1:180669-181170  180669  181170  \n",
       "chr1:181245-181570  181245  181570  \n",
       "chr1:184013-184896  184013  184896  \n",
       "chr1:191222-192099  191222  192099  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.var.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Peak information in `.var` can be used to construct a [PyRanges](https://github.com/biocore-ntnu/pyranges) object on the fly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "+--------------------------+-----------------+------------+--------------+-----------+-----------+\n",
       "| gene_ids                 | feature_types   | genome     | Chromosome   | Start     | End       |\n",
       "| (object)                 | (object)        | (object)   | (category)   | (int32)   | (int32)   |\n",
       "|--------------------------+-----------------+------------+--------------+-----------+-----------|\n",
       "| GL000194.1:55723-56585   | Peaks           | GRCh38     | GL000194.1   | 55723     | 56585     |\n",
       "| GL000194.1:58195-58977   | Peaks           | GRCh38     | GL000194.1   | 58195     | 58977     |\n",
       "| GL000194.1:100995-101880 | Peaks           | GRCh38     | GL000194.1   | 100995    | 101880    |\n",
       "| GL000194.1:110708-111523 | Peaks           | GRCh38     | GL000194.1   | 110708    | 111523    |\n",
       "| ...                      | ...             | ...        | ...          | ...       | ...       |\n",
       "| chrY:26670692-26671520   | Peaks           | GRCh38     | chrY         | 26670692  | 26671520  |\n",
       "| chrY:56734361-56735235   | Peaks           | GRCh38     | chrY         | 56734361  | 56735235  |\n",
       "| chrY:56833820-56834655   | Peaks           | GRCh38     | chrY         | 56833820  | 56834655  |\n",
       "| chrY:56836486-56837375   | Peaks           | GRCh38     | chrY         | 56836486  | 56837375  |\n",
       "+--------------------------+-----------------+------------+--------------+-----------+-----------+\n",
       "Unstranded PyRanges object has 164,487 rows and 6 columns from 37 chromosomes.\n",
       "For printing, the PyRanges was sorted on Chromosome."
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pyranges\n",
    "pyranges.PyRanges(adata.var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Fragments and peak annotation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`chame` detects some default files including `peak_annotation.tsv` and `fragments.tsv.gz`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['peak_annotation', 'peak_motifs_mapping']) {'fragments': 'data/pbmc10k_10x_v2/fragments.tsv.gz'}\n"
     ]
    }
   ],
   "source": [
    "print(\n",
    "    adata.uns[\"atac\"].keys(),\n",
    "    adata.uns['files'],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the peak-motif mapping we can construct a binary peak-motif table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ALX3_MA0634.1</th>\n",
       "      <th>ARGFX_MA1463.1</th>\n",
       "      <th>ARNT2_MA1464.1</th>\n",
       "      <th>ARNT::HIF1A_MA0259.1</th>\n",
       "      <th>ASCL1(var.2)_MA1631.1</th>\n",
       "      <th>ASCL1_MA1100.2</th>\n",
       "      <th>ATF2_MA1632.1</th>\n",
       "      <th>ATF3_MA0605.2</th>\n",
       "      <th>ATF4_MA0833.2</th>\n",
       "      <th>ATF6_MA1466.1</th>\n",
       "      <th>...</th>\n",
       "      <th>ZNF740_MA0753.2</th>\n",
       "      <th>ZNF75D_MA1601.1</th>\n",
       "      <th>ZSCAN29_MA1602.1</th>\n",
       "      <th>ZSCAN4_MA1155.1</th>\n",
       "      <th>Zfx_MA0146.2</th>\n",
       "      <th>Zic1::Zic2_MA1628.1</th>\n",
       "      <th>Zic2_MA1629.1</th>\n",
       "      <th>Znf281_MA1630.1</th>\n",
       "      <th>Znf423_MA0116.1</th>\n",
       "      <th>mix-a_MA0621.1</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Peak</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>chr1:5982069-5982943</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:5982069-5982943</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>chr1:5982069-5982943</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 746 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      ALX3_MA0634.1  ARGFX_MA1463.1  ARNT2_MA1464.1  \\\n",
       "Peak                                                                  \n",
       "chr1:5982069-5982943              0               0               0   \n",
       "chr1:5982069-5982943              0               0               0   \n",
       "chr1:5982069-5982943              0               0               0   \n",
       "\n",
       "                      ARNT::HIF1A_MA0259.1  ASCL1(var.2)_MA1631.1  \\\n",
       "Peak                                                                \n",
       "chr1:5982069-5982943                     0                      0   \n",
       "chr1:5982069-5982943                     0                      0   \n",
       "chr1:5982069-5982943                     0                      0   \n",
       "\n",
       "                      ASCL1_MA1100.2  ATF2_MA1632.1  ATF3_MA0605.2  \\\n",
       "Peak                                                                 \n",
       "chr1:5982069-5982943               0              0              0   \n",
       "chr1:5982069-5982943               0              0              0   \n",
       "chr1:5982069-5982943               0              0              0   \n",
       "\n",
       "                      ATF4_MA0833.2  ATF6_MA1466.1  ...  ZNF740_MA0753.2  \\\n",
       "Peak                                                ...                    \n",
       "chr1:5982069-5982943              0              0  ...                0   \n",
       "chr1:5982069-5982943              0              0  ...                0   \n",
       "chr1:5982069-5982943              0              0  ...                0   \n",
       "\n",
       "                      ZNF75D_MA1601.1  ZSCAN29_MA1602.1  ZSCAN4_MA1155.1  \\\n",
       "Peak                                                                       \n",
       "chr1:5982069-5982943                0                 0                0   \n",
       "chr1:5982069-5982943                0                 0                0   \n",
       "chr1:5982069-5982943                0                 0                0   \n",
       "\n",
       "                      Zfx_MA0146.2  Zic1::Zic2_MA1628.1  Zic2_MA1629.1  \\\n",
       "Peak                                                                     \n",
       "chr1:5982069-5982943             0                    0              0   \n",
       "chr1:5982069-5982943             0                    0              0   \n",
       "chr1:5982069-5982943             0                    0              0   \n",
       "\n",
       "                      Znf281_MA1630.1  Znf423_MA0116.1  mix-a_MA0621.1  \n",
       "Peak                                                                    \n",
       "chr1:5982069-5982943                0                0               0  \n",
       "chr1:5982069-5982943                0                0               0  \n",
       "chr1:5982069-5982943                0                0               0  \n",
       "\n",
       "[3 rows x 746 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.get_dummies(\n",
    "    adata.uns[\"atac\"][\"peak_motifs_mapping\"].Motif\n",
    ").head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summary statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`chame` also loads summary statistics for the dataset when available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Sample ID': '10k_pbmc_ATACv2_nextgem_Chromium_X',\n",
       " 'Genome': 'GRCh38',\n",
       " 'Pipeline version': 'cellranger-atac-2.1.0',\n",
       " 'Estimated number of cells': 10273,\n",
       " 'Confidently mapped read pairs': 0.931,\n",
       " 'Estimated bulk library complexity': 369740731.2,\n",
       " 'Fraction of all fragments in cells': 0.933,\n",
       " 'Fraction of all fragments that pass all filters and overlap called peaks': 0.3067,\n",
       " 'Fraction of genome in peaks': 0.0451,\n",
       " 'Fraction of high-quality fragments in cells': 0.9574,\n",
       " 'Fraction of high-quality fragments overlapping TSS': 0.4601,\n",
       " 'Fraction of high-quality fragments overlapping peaks': 0.6586,\n",
       " 'Fraction of transposition events in peaks in cells': 0.6253,\n",
       " 'Fragments flanking a single nucleosome': 0.4298,\n",
       " 'Fragments in nucleosome-free regions': 0.4638,\n",
       " 'Mean raw read pairs per cell': 45448.7244,\n",
       " 'Median high-quality fragments per cell': 20046.0,\n",
       " 'Non-nuclear read pairs': 0.0013,\n",
       " 'Number of peaks': 164487,\n",
       " 'Percent duplicates': 0.4551,\n",
       " 'Q30 bases in barcode': 0.9005,\n",
       " 'Q30 bases in read 1': 0.9536,\n",
       " 'Q30 bases in read 2': 0.944,\n",
       " 'Q30 bases in sample index i1': 0.9172,\n",
       " 'Sequenced read pairs': 466894746,\n",
       " 'Sequencing saturation': 0.6143,\n",
       " 'TSS enrichment score': 10.0456,\n",
       " 'Unmapped read pairs': 0.0097,\n",
       " 'Valid barcodes': 0.9634}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.uns[\"summary\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}