ArchR I/O#

ArchR is an R package for scATAC-seq data analysis (Granja et al., 2021).

chame comes with a reader for ArchR’s Arrow files:

[1]:
from chame.io import read_arrow

ArchR: Creating Arrow Files#

For resproducibility, we will follow ArchR’s own tutorial on creating Arrow files.

Here, we use rpy2 for R calls, and it is assumed that the corresponding R environemnt is configured and has ArchR installed.

In practice, one would most likely use existing Arrow files. So feel free to skip this section!

[2]:
import os
os.chdir("../../")
if not os.path.exists("archr"):
    os.mkdir("archr")
os.chdir("archr/")
[3]:
from rpy2.robjects.packages import importr

archr = importr("ArchR")
R[write to console]:
                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
        ,--' ,----`-,__ ___/'  --,-`-===================##========>
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______
          /   \     |   _  \      /      ||  |  |  | |   _  \
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |
        /  /_\  \   |      /     |  |     |   __   | |      /
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|


R[write to console]: Setting default number of Parallel threads to 4.

R[write to console]: Setting addArchRVerbose = FALSE

[4]:
archr.addArchRGenome("hg19");
archr.addArchRThreads(threads = 6);
R[write to console]: Setting default genome to Hg19.

R[write to console]: Setting default number of Parallel threads to 6.

[5]:
inputFiles = archr.getTutorialData("Hematopoiesis")

arrowFiles = archr.createArrowFiles(
  inputFiles=inputFiles,
  sampleNames=inputFiles.names,
  minTSS=4,
  filterFrags=1000,
  addTileMat=True,
  addGeneScoreMat=True,
)

Reading matrices from Arrow files#

Tiles#

With a matrix specified, an AnnData object will be returned. For tiles, it would have dimensions n_cells by n_tiles:

[6]:
tiles = read_arrow("scATAC_PBMC_R1.arrow", matrix="tiles")
tiles
[6]:
AnnData object with n_obs × n_vars = 2453 × 6072620
    obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
    var: 'Chromosome', 'idx', 'Start', 'End'
    uns: 'params'

As displayed, this AnnData object has cell metadata included together with the information about features (tiles):

[7]:
tiles.obs.head(2)
[7]:
BlacklistRatio CellNames Completed Date NucleosomeRatio PassQC PromoterRatio ReadsInBlacklist ReadsInPromoter ReadsInTSS Sample TSSEnrichment nDiFrags nFrags nMonoFrags nMultiFrags
CCGTGAGAGACCATAA-1 0.010204 CCGTGAGAGACCATAA-1 Finished 2022-06-10 3.578472 1.0 0.064401 1861.0 11746.0 2926.0 scATAC_PBMC_R1 6.652 47936.0 91194.0 19918.0 23340.0
TGCACCTTCGGCTATA-1 0.008097 TGCACCTTCGGCTATA-1 Finished 2022-06-10 4.809345 1.0 0.042707 1351.0 7126.0 1571.0 scATAC_PBMC_R1 4.699 46671.0 83428.0 14361.0 22396.0
[8]:
tiles.var.head()
[8]:
Chromosome idx Start End
chr1:0-500 chr1 1 0 500
chr1:500-1000 chr1 2 500 1000
chr1:1000-1500 chr1 3 1000 1500
chr1:1500-2000 chr1 4 1500 2000
chr1:2000-2500 chr1 5 2000 2500

↑ As AnnData currently requires string indices for both cells and features, tiles here are automatically named by chame according to their coordinates.

Notably, chame strives to return data frames that can be readily used to create PyRanges objects:

[9]:
from pyranges import PyRanges
pr = PyRanges(tiles.var)

pr["chr1", 0:4200]
[9]:
+--------------+-----------+-----------+-----------+
| Chromosome   | idx       | Start     | End       |
| (category)   | (int32)   | (int32)   | (int32)   |
|--------------+-----------+-----------+-----------|
| chr1         | 1         | 0         | 500       |
| chr1         | 2         | 500       | 1000      |
| chr1         | 3         | 1000      | 1500      |
| chr1         | 4         | 1500      | 2000      |
| ...          | ...       | ...       | ...       |
| chr1         | 6         | 2500      | 3000      |
| chr1         | 7         | 3000      | 3500      |
| chr1         | 8         | 3500      | 4000      |
| chr1         | 9         | 4000      | 4500      |
+--------------+-----------+-----------+-----------+
Unstranded PyRanges object has 9 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Params for the TileMatrix from the Arrow file are carefully preserved and put into .uns:

[10]:
tiles.uns["params"][:3]
[10]:
[{'seqnames': 'chr1', 'length': 249250621, 'tileSize': 500.0},
 {'seqnames': 'chr2', 'length': 243199373, 'tileSize': 500.0},
 {'seqnames': 'chr3', 'length': 198022430, 'tileSize': 500.0}]

Gene scores#

For gene scores, variables will be respectively genes:

[11]:
gene_scores = read_arrow("scATAC_PBMC_R1.arrow", matrix="gene_scores")
gene_scores
[11]:
AnnData object with n_obs × n_vars = 2453 × 23127
    obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
    var: 'seqnames', 'start', 'end', 'strand', 'name', 'idx'
    uns: 'params'

Params for the GeneScoresMatrix from the Arrow file are also put into .uns:

[12]:
gene_scores.uns["params"][:3]
[12]:
[{'extendUpstream': 1000.0,
  'extendDownstream': 1000.0,
  'geneUpstream': 1000.0,
  'geneDownstream': 1000.0,
  'scaleTo': 10000.0,
  'tileSize': 500.0,
  'ceiling': 4.0,
  'geneModel': b'exp(-abs(x)/5000) + exp(-1)'},
 {'extendUpstream': 100000.0,
  'extendDownstream': 100000.0,
  'geneUpstream': 100000.0,
  'geneDownstream': 100000.0,
  'scaleTo': 10000.0,
  'tileSize': 500.0,
  'ceiling': 4.0,
  'geneModel': b'exp(-abs(x)/5000) + exp(-1)'}]

All matrices#

If no matrix is specified, all available matrices are read into a MuData object by default:

[13]:
mdata = read_arrow("scATAC_PBMC_R1.arrow")
mdata
[13]:
MuData object with n_obs × n_vars = 2453 × 6095747
  obs:      'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
  var:      'idx'
  2 modalities
    tiles:  2453 x 6072620
      var:  'Chromosome', 'idx', 'Start', 'End'
      uns:  'params'
    gene_scores:    2453 x 23127
      var:  'seqnames', 'start', 'end', 'strand', 'name', 'idx'
      uns:  'params'