ArchR I/O#

ArchR is an R package for scATAC-seq data analysis (Granja et al., 2021).

chame comes with a reader for ArchR’s Arrow files:

[1]:

from chame.io import read_arrow

ArchR: Creating Arrow Files#

For resproducibility, we will follow ArchR’s own tutorial on creating Arrow files.

Here, we use rpy2 for R calls, and it is assumed that the corresponding R environemnt is configured and has ArchR installed.

In practice, one would most likely use existing Arrow files. So feel free to skip this section!

[2]:

import os
os.chdir("../../")
if not os.path.exists("archr"):
    os.mkdir("archr")
os.chdir("archr/")

[3]:

from rpy2.robjects.packages import importr

archr = importr("ArchR")

R[write to console]:
                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
        ,--' ,----`-,__ ___/'  --,-`-===================##========>
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______
          /   \     |   _  \      /      ||  |  |  | |   _  \
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |
        /  /_\  \   |      /     |  |     |   __   | |      /
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|


R[write to console]: Setting default number of Parallel threads to 4.

R[write to console]: Setting addArchRVerbose = FALSE

[4]:

archr.addArchRGenome("hg19");
archr.addArchRThreads(threads = 6);

R[write to console]: Setting default genome to Hg19.

R[write to console]: Setting default number of Parallel threads to 6.

[5]:

inputFiles = archr.getTutorialData("Hematopoiesis")

arrowFiles = archr.createArrowFiles(
  inputFiles=inputFiles,
  sampleNames=inputFiles.names,
  minTSS=4,
  filterFrags=1000,
  addTileMat=True,
  addGeneScoreMat=True,
)

Reading matrices from Arrow files#

Tiles#

With a matrix specified, an AnnData object will be returned. For tiles, it would have dimensions n_cells by n_tiles:

[6]:

tiles = read_arrow("scATAC_PBMC_R1.arrow", matrix="tiles")
tiles

[6]:

AnnData object with n_obs × n_vars = 2453 × 6072620
    obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
    var: 'Chromosome', 'idx', 'Start', 'End'
    uns: 'params'

As displayed, this AnnData object has cell metadata included together with the information about features (tiles):

[7]:

tiles.obs.head(2)

[7]:

	BlacklistRatio	CellNames	Completed	Date	NucleosomeRatio	PassQC	PromoterRatio	ReadsInBlacklist	ReadsInPromoter	ReadsInTSS	Sample	TSSEnrichment	nDiFrags	nFrags	nMonoFrags	nMultiFrags
CCGTGAGAGACCATAA-1	0.010204	CCGTGAGAGACCATAA-1	Finished	2022-06-10	3.578472	1.0	0.064401	1861.0	11746.0	2926.0	scATAC_PBMC_R1	6.652	47936.0	91194.0	19918.0	23340.0
TGCACCTTCGGCTATA-1	0.008097	TGCACCTTCGGCTATA-1	Finished	2022-06-10	4.809345	1.0	0.042707	1351.0	7126.0	1571.0	scATAC_PBMC_R1	4.699	46671.0	83428.0	14361.0	22396.0

[8]:

tiles.var.head()

[8]:

	Chromosome	idx	Start	End
chr1:0-500	chr1	1	0	500
chr1:500-1000	chr1	2	500	1000
chr1:1000-1500	chr1	3	1000	1500
chr1:1500-2000	chr1	4	1500	2000
chr1:2000-2500	chr1	5	2000	2500

↑ As AnnData currently requires string indices for both cells and features, tiles here are automatically named by chame according to their coordinates.

Notably, chame strives to return data frames that can be readily used to create PyRanges objects:

[9]:

from pyranges import PyRanges
pr = PyRanges(tiles.var)

pr["chr1", 0:4200]

[9]:

+--------------+-----------+-----------+-----------+
| Chromosome   | idx       | Start     | End       |
| (category)   | (int32)   | (int32)   | (int32)   |
|--------------+-----------+-----------+-----------|
| chr1         | 1         | 0         | 500       |
| chr1         | 2         | 500       | 1000      |
| chr1         | 3         | 1000      | 1500      |
| chr1         | 4         | 1500      | 2000      |
| ...          | ...       | ...       | ...       |
| chr1         | 6         | 2500      | 3000      |
| chr1         | 7         | 3000      | 3500      |
| chr1         | 8         | 3500      | 4000      |
| chr1         | 9         | 4000      | 4500      |
+--------------+-----------+-----------+-----------+
Unstranded PyRanges object has 9 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Params for the TileMatrix from the Arrow file are carefully preserved and put into .uns:

[10]:

tiles.uns["params"][:3]

[10]:

[{'seqnames': 'chr1', 'length': 249250621, 'tileSize': 500.0},
 {'seqnames': 'chr2', 'length': 243199373, 'tileSize': 500.0},
 {'seqnames': 'chr3', 'length': 198022430, 'tileSize': 500.0}]

Gene scores#

For gene scores, variables will be respectively genes:

[11]:

gene_scores = read_arrow("scATAC_PBMC_R1.arrow", matrix="gene_scores")
gene_scores

[11]:

AnnData object with n_obs × n_vars = 2453 × 23127
    obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
    var: 'seqnames', 'start', 'end', 'strand', 'name', 'idx'
    uns: 'params'

Params for the GeneScoresMatrix from the Arrow file are also put into .uns:

[12]:

gene_scores.uns["params"][:3]

[12]:

[{'extendUpstream': 1000.0,
  'extendDownstream': 1000.0,
  'geneUpstream': 1000.0,
  'geneDownstream': 1000.0,
  'scaleTo': 10000.0,
  'tileSize': 500.0,
  'ceiling': 4.0,
  'geneModel': b'exp(-abs(x)/5000) + exp(-1)'},
 {'extendUpstream': 100000.0,
  'extendDownstream': 100000.0,
  'geneUpstream': 100000.0,
  'geneDownstream': 100000.0,
  'scaleTo': 10000.0,
  'tileSize': 500.0,
  'ceiling': 4.0,
  'geneModel': b'exp(-abs(x)/5000) + exp(-1)'}]

All matrices#

If no matrix is specified, all available matrices are read into a MuData object by default:

[13]:

mdata = read_arrow("scATAC_PBMC_R1.arrow")
mdata

[13]:

MuData object with n_obs × n_vars = 2453 × 6095747
  obs:      'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
  var:      'idx'
  2 modalities
    tiles:  2453 x 6072620
      var:  'Chromosome', 'idx', 'Start', 'End'
      uns:  'params'
    gene_scores:    2453 x 23127
      var:  'seqnames', 'start', 'end', 'strand', 'name', 'idx'
      uns:  'params'

ArchR I/O

Contents