ArchR I/O
Contents
ArchR I/O#
ArchR is an R package for scATAC-seq data analysis (Granja et al., 2021).
chame
comes with a reader for ArchR’s Arrow files:
[1]:
from chame.io import read_arrow
ArchR: Creating Arrow Files#
For resproducibility, we will follow ArchR’s own tutorial on creating Arrow files.
Here, we use rpy2
for R calls, and it is assumed that the corresponding R environemnt is configured and has ArchR installed.
In practice, one would most likely use existing Arrow files. So feel free to skip this section!
[2]:
import os
os.chdir("../../")
if not os.path.exists("archr"):
os.mkdir("archr")
os.chdir("archr/")
[3]:
from rpy2.robjects.packages import importr
archr = importr("ArchR")
R[write to console]:
/ |
/ \
. / |.
\\\ / |.
\\\ / `|.
\\\ / |.
\ / |\
\\#####\ / ||
==###########> / ||
\\##==......\ / ||
______ = =|__ /__ || \\\
,--' ,----`-,__ ___/' --,-`-===================##========>
\ ' ##_______ _____ ,--,__,=##,__ ///
, __== ___,-,__,--'#' ===' `-' | ##,-/
-,____,---' \\####\\________________,--\\_##,/
___ .______ ______ __ __ .______
/ \ | _ \ / || | | | | _ \
/ ^ \ | |_) | | ,----'| |__| | | |_) |
/ /_\ \ | / | | | __ | | /
/ _____ \ | |\ \\___ | `----.| | | | | |\ \\___.
/__/ \__\ | _| `._____| \______||__| |__| | _| `._____|
R[write to console]: Setting default number of Parallel threads to 4.
R[write to console]: Setting addArchRVerbose = FALSE
[4]:
archr.addArchRGenome("hg19");
archr.addArchRThreads(threads = 6);
R[write to console]: Setting default genome to Hg19.
R[write to console]: Setting default number of Parallel threads to 6.
[5]:
inputFiles = archr.getTutorialData("Hematopoiesis")
arrowFiles = archr.createArrowFiles(
inputFiles=inputFiles,
sampleNames=inputFiles.names,
minTSS=4,
filterFrags=1000,
addTileMat=True,
addGeneScoreMat=True,
)
Reading matrices from Arrow files#
Tiles#
With a matrix specified, an AnnData object will be returned. For tiles, it would have dimensions n_cells by n_tiles:
[6]:
tiles = read_arrow("scATAC_PBMC_R1.arrow", matrix="tiles")
tiles
[6]:
AnnData object with n_obs × n_vars = 2453 × 6072620
obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
var: 'Chromosome', 'idx', 'Start', 'End'
uns: 'params'
As displayed, this AnnData object has cell metadata included together with the information about features (tiles):
[7]:
tiles.obs.head(2)
[7]:
BlacklistRatio | CellNames | Completed | Date | NucleosomeRatio | PassQC | PromoterRatio | ReadsInBlacklist | ReadsInPromoter | ReadsInTSS | Sample | TSSEnrichment | nDiFrags | nFrags | nMonoFrags | nMultiFrags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CCGTGAGAGACCATAA-1 | 0.010204 | CCGTGAGAGACCATAA-1 | Finished | 2022-06-10 | 3.578472 | 1.0 | 0.064401 | 1861.0 | 11746.0 | 2926.0 | scATAC_PBMC_R1 | 6.652 | 47936.0 | 91194.0 | 19918.0 | 23340.0 |
TGCACCTTCGGCTATA-1 | 0.008097 | TGCACCTTCGGCTATA-1 | Finished | 2022-06-10 | 4.809345 | 1.0 | 0.042707 | 1351.0 | 7126.0 | 1571.0 | scATAC_PBMC_R1 | 4.699 | 46671.0 | 83428.0 | 14361.0 | 22396.0 |
[8]:
tiles.var.head()
[8]:
Chromosome | idx | Start | End | |
---|---|---|---|---|
chr1:0-500 | chr1 | 1 | 0 | 500 |
chr1:500-1000 | chr1 | 2 | 500 | 1000 |
chr1:1000-1500 | chr1 | 3 | 1000 | 1500 |
chr1:1500-2000 | chr1 | 4 | 1500 | 2000 |
chr1:2000-2500 | chr1 | 5 | 2000 | 2500 |
↑ As AnnData currently requires string indices for both cells and features, tiles here are automatically named by chame
according to their coordinates.
Notably, chame
strives to return data frames that can be readily used to create PyRanges objects:
[9]:
from pyranges import PyRanges
pr = PyRanges(tiles.var)
pr["chr1", 0:4200]
[9]:
+--------------+-----------+-----------+-----------+
| Chromosome | idx | Start | End |
| (category) | (int32) | (int32) | (int32) |
|--------------+-----------+-----------+-----------|
| chr1 | 1 | 0 | 500 |
| chr1 | 2 | 500 | 1000 |
| chr1 | 3 | 1000 | 1500 |
| chr1 | 4 | 1500 | 2000 |
| ... | ... | ... | ... |
| chr1 | 6 | 2500 | 3000 |
| chr1 | 7 | 3000 | 3500 |
| chr1 | 8 | 3500 | 4000 |
| chr1 | 9 | 4000 | 4500 |
+--------------+-----------+-----------+-----------+
Unstranded PyRanges object has 9 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
Params for the TileMatrix from the Arrow file are carefully preserved and put into .uns
:
[10]:
tiles.uns["params"][:3]
[10]:
[{'seqnames': 'chr1', 'length': 249250621, 'tileSize': 500.0},
{'seqnames': 'chr2', 'length': 243199373, 'tileSize': 500.0},
{'seqnames': 'chr3', 'length': 198022430, 'tileSize': 500.0}]
Gene scores#
For gene scores, variables will be respectively genes:
[11]:
gene_scores = read_arrow("scATAC_PBMC_R1.arrow", matrix="gene_scores")
gene_scores
[11]:
AnnData object with n_obs × n_vars = 2453 × 23127
obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags'
var: 'seqnames', 'start', 'end', 'strand', 'name', 'idx'
uns: 'params'
Params for the GeneScoresMatrix from the Arrow file are also put into .uns
:
[12]:
gene_scores.uns["params"][:3]
[12]:
[{'extendUpstream': 1000.0,
'extendDownstream': 1000.0,
'geneUpstream': 1000.0,
'geneDownstream': 1000.0,
'scaleTo': 10000.0,
'tileSize': 500.0,
'ceiling': 4.0,
'geneModel': b'exp(-abs(x)/5000) + exp(-1)'},
{'extendUpstream': 100000.0,
'extendDownstream': 100000.0,
'geneUpstream': 100000.0,
'geneDownstream': 100000.0,
'scaleTo': 10000.0,
'tileSize': 500.0,
'ceiling': 4.0,
'geneModel': b'exp(-abs(x)/5000) + exp(-1)'}]
All matrices#
If no matrix is specified, all available matrices are read into a MuData object by default:
[13]:
mdata = read_arrow("scATAC_PBMC_R1.arrow")
mdata
[13]:
MuData object with n_obs × n_vars = 2453 × 6095747 obs: 'BlacklistRatio', 'CellNames', 'Completed', 'Date', 'NucleosomeRatio', 'PassQC', 'PromoterRatio', 'ReadsInBlacklist', 'ReadsInPromoter', 'ReadsInTSS', 'Sample', 'TSSEnrichment', 'nDiFrags', 'nFrags', 'nMonoFrags', 'nMultiFrags' var: 'idx' 2 modalities tiles: 2453 x 6072620 var: 'Chromosome', 'idx', 'Start', 'End' uns: 'params' gene_scores: 2453 x 23127 var: 'seqnames', 'start', 'end', 'strand', 'name', 'idx' uns: 'params'