[1]:
from pathlib import Path
import json
import pyarrow.parquet as pq
[2]:
import mudata
mudata.set_options(pull_on_update=False)
[2]:
<mudata._core.config.set_options at 0x35f684a40>
Parquet-based serialisation#
anndata
library implements multiple ways to serialise the data including the default HDF5-based H5AD files as well as Zarr files.
Same interfaces are available in the mudata
library — for H5MU files and Zarr files.
While largely following the same format specification (AnnData, MuData), different serialisation formats can provide benefits in various applications and scenarios.
pqdata
implements a similar serialisation strategy that is quite natural as well as can be benefitial for memory and space efficiency and integration with certain analytical tools. It is based on Parquet files, and hence this format is termed here as pqdata files.
For a sneak peek into the usage and the benefits of Parquest files for analytics, you might check the following references:
Why parquet files are my preferred API for bulk open data (January 2023)
Extracting, converting, and querying data in local files using clickhouse-local (January 2023)
Reflecting on Apache Arrow in 2022 (December 2022)
Holy Duck! Fast Analysis with DuckDB + Pyarrow (April 2022)
SQL Query on MinIO (February 2022)
SQL Query on Parquet Files with DataFusion (January 2022)
DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB (December 2021)
Querying Parquet with Precision using DuckDB (June 2021)
Import writers:
[3]:
from pqdata import write_anndata, write_mudata
… and readers:
[4]:
from pqdata import read_anndata, read_mudata
First, let’s download a multimodal dataset in the H5MU format:
[5]:
data = Path("data")
[6]:
import mudatasets
mdata = mudatasets.load("pbmc5k_citeseq", files=["minipbcite.h5mu"], data_dir=data, backed=False)
■ File minipbcite.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/minipbcite.h5mu
■ Checksum is validated (md5) for minipbcite.h5mu
■ Loading minipbcite.h5mu...
Writing#
Write the MuData object:
[7]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata"
write_mudata(mdata, file)
Notice that the data is not overwritten by default by this writer:
[8]:
try:
# need to provide overwrite=True to overwrite
write_mudata(mdata, file)
except FileExistsError as e:
print(e)
[Errno 17] File exists: 'data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata'
File size#
We can also notice smaller file size:
[9]:
! du -sh data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
9.0M data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
[10]:
! du -sh data/pbmc5k_citeseq/minipbcite.h5mu
16M data/pbmc5k_citeseq/minipbcite.h5mu
On larger datasets, the difference will be more noticeable. Moreoever, other compression strategies can be used to get even smaller files:
[11]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata_zstd.pqdata"
write_mudata(mdata, file, compression="zstd")
Sometimes it’s easier to share the data as a single file, and archiving and compression will make the volume even smaller:
[12]:
! tar -czf minipbcite.tar.gz data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
[13]:
! du -sh minipbcite.tar.gz
7.9M minipbcite.tar.gz
.pqdata structure#
We can see how the dataset is stored on disc:
[14]:
! tree data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
├── mod
│ ├── prot
│ │ ├── X.parquet
│ │ ├── layers
│ │ │ └── counts.parquet
│ │ ├── obs.parquet
│ │ ├── obsm
│ │ │ ├── X_pca.parquet
│ │ │ └── X_umap.parquet
│ │ ├── obsp
│ │ │ ├── connectivities.parquet
│ │ │ └── distances.parquet
│ │ ├── uns
│ │ │ └── pca
│ │ │ ├── variance.parquet
│ │ │ └── variance_ratio.parquet
│ │ ├── uns.json
│ │ ├── var.parquet
│ │ └── varm
│ │ └── PCs.parquet
│ └── rna
│ ├── X.parquet
│ ├── obs.parquet
│ ├── obsm
│ │ ├── X_pca.parquet
│ │ └── X_umap.parquet
│ ├── obsp
│ │ ├── connectivities.parquet
│ │ └── distances.parquet
│ ├── uns
│ │ ├── celltype_colors.parquet
│ │ ├── leiden_colors.parquet
│ │ ├── pca
│ │ │ ├── variance.parquet
│ │ │ └── variance_ratio.parquet
│ │ └── rank_genes_groups
│ │ ├── logfoldchanges.parquet
│ │ ├── names.parquet
│ │ ├── pvals.parquet
│ │ ├── pvals_adj.parquet
│ │ └── scores.parquet
│ ├── uns.json
│ ├── var.parquet
│ └── varm
│ └── PCs.parquet
├── obs.parquet
├── obsm
│ ├── X_mofa.parquet
│ ├── X_mofa_umap.parquet
│ ├── X_umap.parquet
│ └── X_wnn_umap.parquet
├── obsmap
│ ├── prot.parquet
│ └── rna.parquet
├── obsp
│ ├── connectivities.parquet
│ ├── distances.parquet
│ ├── wnn_connectivities.parquet
│ └── wnn_distances.parquet
├── pqdata.json
├── var.parquet
├── varm
│ └── LFs.parquet
└── varmap
├── prot.parquet
└── rna.parquet
21 directories, 46 files
Parquet files#
High-dimensional objects that were NumPy arrays or pandas DataFrames in memory are now stored in individual Parquet files.
Moreover, serialising these objects largely relies on pyarrow
.
[15]:
pq.read_table(file / "obs.parquet").to_pandas().head()
[15]:
louvain | leiden | leiden_wnn | celltype | |
---|---|---|---|---|
CAGCCAGGTCTCGACG-1 | 1 | 1 | 0 | CD4+ naïve T |
TTCTTCCTCTCGGTAA-1 | 1 | 1 | 0 | CD4+ naïve T |
CGGGTCAAGAGAGGTA-1 | 1 | 1 | 1 | CD4+ naïve T |
TACCCGTCATAATCCG-1 | 1 | 1 | 1 | CD4+ naïve T |
TGGGTTAGTGAATTAG-1 | 2 | 2 | 1 | CD4+ naïve T |
Embeddings, e.g. UMAP:
[16]:
pq.read_table(file / "obsm/X_umap.parquet")
[16]:
pyarrow.Table
1: float
2: float
----
1: [[9.66083,10.850063,12.345492,12.538155,12.528112,...,-1.9723414,-1.5232953,-1.7778809,-1.7384567,-1.5827981]]
2: [[5.779709,6.7745605,5.6794906,6.2592025,4.833545,...,-2.152379,-2.1579309,-2.3050048,-2.199118,-2.308012]]
[17]:
pq.read_table(file / "obsm/X_umap.parquet").to_pandas().to_numpy()[:10]
[17]:
array([[ 9.66083 , 5.779709 ],
[10.850063 , 6.7745605],
[12.345492 , 5.6794906],
[12.538155 , 6.2592025],
[12.528112 , 4.833545 ],
[13.152369 , 5.566309 ],
[ 9.910739 , 9.146889 ],
[ 8.418465 , 5.2574263],
[12.064896 , 5.3718014],
[11.305964 , -7.5169115]], dtype=float32)
Dense count matrix with normalised values, which contains protein names (.mod['prot'].var_names
) as column names:
[18]:
pq.read_table(file / "mod/prot/X.parquet").to_pandas().tail()
[18]:
CD3_TotalSeqB | CD4_TotalSeqB | CD8a_TotalSeqB | CD11b_TotalSeqB | CD14_TotalSeqB | CD15_TotalSeqB | CD16_TotalSeqB | CD19_TotalSeqB | CD20_TotalSeqB | CD25_TotalSeqB | ... | CD86_TotalSeqB | CD127_TotalSeqB | CD137_TotalSeqB | CD197_TotalSeqB | CD274_TotalSeqB | CD278_TotalSeqB | CD335_TotalSeqB | PD-1_TotalSeqB | HLA-DR_TotalSeqB | TIGIT_TotalSeqB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
406 | -1.075961 | 1.245624 | 0.258909 | 2.032457 | 0.591571 | 1.429440 | 0.137301 | 2.361204 | 1.693235 | 1.284448 | ... | 4.252487 | 0.327570 | 1.309700 | 0.107906 | 0.710895 | 0.206146 | 2.295595 | 2.710157 | 13.345463 | 2.335479 |
407 | -0.038257 | 8.728498 | 2.102803 | 1.870541 | 0.782583 | 2.021527 | 2.680047 | 1.295704 | 1.716122 | 1.112039 | ... | 1.883393 | 0.701129 | 2.403748 | 2.701393 | 1.573005 | 3.247963 | 2.305770 | 2.435769 | 13.624533 | 2.287940 |
408 | 0.254669 | 10.289223 | 1.933862 | 2.539970 | 0.168840 | 2.607727 | 2.538013 | 0.113553 | 3.241333 | 2.223520 | ... | 6.140669 | 1.847788 | 1.193516 | 0.819007 | 1.382885 | 0.913976 | 2.043625 | 0.751728 | 14.481518 | 2.834265 |
409 | 0.243213 | 13.866776 | 0.072664 | 1.043179 | -0.051139 | 4.473774 | 1.771171 | 1.584222 | 4.893187 | 4.736353 | ... | 8.189964 | 2.338869 | 2.845908 | 1.140788 | -0.275009 | 2.504928 | 4.931875 | 3.187832 | 19.352989 | 0.957393 |
410 | 2.687545 | 14.526915 | 3.441904 | 1.239545 | 2.281008 | 2.358840 | 1.231699 | 1.176513 | 3.691090 | 1.004248 | ... | 11.924853 | 1.308427 | 0.306832 | 3.636075 | 1.498553 | 3.189132 | 0.258857 | 4.084857 | 18.031036 | 2.995944 |
5 rows × 29 columns
Sparse count matrix with counts are stored in the COO format:
[19]:
pq.read_table(file / "mod/prot/layers/counts.parquet").to_pandas().tail()
[19]:
row | col | data | |
---|---|---|---|
11098 | 410 | 23 | 2.0 |
11099 | 410 | 24 | 6.0 |
11100 | 410 | 26 | 6.0 |
11101 | 410 | 27 | 888.0 |
11102 | 410 | 28 | 3.0 |
For example, we can count proteins with a simple SQL query:
[20]:
import duckdb
duckdb.query(f'''
SELECT COUNT(*)
FROM '{file}/mod/prot/var.parquet'
''').fetchall()
[20]:
[(29,)]
If you want to read more about the Python API of DuckDB, you can find an intro here.
JSON files#
Simpler objects such as scalar values are serialised into JSON files.
[21]:
with open(file / "mod/rna/uns.json") as uns_file:
uns = json.load(uns_file)
print(json.dumps(uns, indent=4))
{
"hvg": {
"flavor": "seurat"
},
"leiden": {
"params": {
"n_iterations": -1,
"random_state": 0,
"resolution": 0.75
}
},
"neighbors": {
"connectivities_key": "connectivities",
"distances_key": "distances",
"params": {
"method": "umap",
"metric": "euclidean",
"n_neighbors": 15,
"random_state": 0
}
},
"pca": {
"params": {
"use_highly_variable": true,
"zero_center": true
}
},
"rank_genes_groups": {
"params": {
"corr_method": "benjamini-hochberg",
"groupby": "leiden",
"method": "t-test_overestim_var",
"reference": "rest",
"use_raw": true
}
},
"umap": {
"params": {
"a": 0.5830300205483709,
"b": 1.334166992455648,
"random_state": 11
}
}
}
Reading#
AnnData reader will return an in-memory AnnData object, and MuData reader will return an in-memory MuData object:
[22]:
read_mudata(file)
[22]:
MuData object with n_obs × n_vars = 411 × 56 obs: 'louvain', 'leiden', 'leiden_wnn', 'celltype' var: 'feature_types', 'gene_ids', 'highly_variable' obsm: 'X_wnn_umap', 'X_umap', 'X_mofa_umap', 'X_mofa' varm: 'LFs' obsp: 'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances' 2 modalities prot: 411 x 29 var: 'gene_ids', 'feature_types', 'highly_variable' uns: 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' rna: 411 x 27 obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype' var: 'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std' uns: 'hvg', 'leiden', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'celltype_colors', 'leiden_colors' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'connectivities', 'distances'
File reading speed#
Read speads are not compromised and are actually improved:
[23]:
import mudata
[24]:
%%timeit
mudata.read("data/pbmc5k_citeseq/minipbcite.h5mu")
93.9 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[25]:
%%timeit
read_mudata(file)
45.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Together, these methods in pqdata
make the Parquet-based serialisation compatible with the whole scverse ecosystem while also providing benefits for out-of-memory analytics on large datasets.