Low-level access to .pqdata

Contents

Low-level access to .pqdata#

Akin h5py and zarr libraries, pqdata implements a simple lower-level access interface for the contents of the .pqdata directories.

Some imports first:

[1]:
import numpy as np
from pathlib import Path
from pyarrow import parquet as pa
[2]:
import pqdata

Prepare the data:

[3]:
data = Path("data")
[4]:
import mudatasets
import mudata
mudata.set_options(pull_on_update=False)
mdata = mudatasets.load("pbmc5k_citeseq", files=["minipbcite.h5mu"], data_dir=data, backed=False)
■ File minipbcite.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/minipbcite.h5mu
■ Checksum is validated (md5) for minipbcite.h5mu
■ Loading minipbcite.h5mu...
[5]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata"
pqdata.write_mudata(mdata, file)

open()#

open() is a simple entry point:

[6]:
f = pqdata.open(file)
f
[6]:
ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata)

“Opening” the file is fast as it doesn’t do much apart from remembering the location it is pointed at:

[7]:
f.path
[7]:
PosixPath('data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata')

It returns an object that can be traversed in a straightforward fashion.

E.g. individual tables can be reached:

[8]:
f["mod"]["rna"]["X"]
[8]:
ParquetArray(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/X.parquet): shape (411,27), type (NKG7:float, KLRC2:float, GNLY:float, IGHM:float, IRF8:float, CD8B:float, CD79A:float, CD14:float, MS4A7:float, CCL5:float, FCGR3A:float, IL4R:float, IGHD:float, S100A8:float, LYZ:float, CD8A:float, FOXP3:float, IL2RA:float, TCL1A:float, TCF4:float, ITGAM:float, TRAC:float, IL7R:float, CST3:float, ITGB1:float, MS4A1:float, KLF4:float)

as well as collections:

[9]:
list(f["mod"]["rna"]["uns"])
[9]:
['umap',
 'leiden',
 'celltype_colors',
 'leiden_colors',
 'pca',
 'neighbors',
 'hvg',
 'rank_genes_groups']

Note that simple structures and scalars stored in JSON files are actually read into memory during traversing:

[10]:
f["mod"]["rna"]["uns"]["pca"]["params"]
[10]:
{'use_highly_variable': True, 'zero_center': True}

Lightweight objects store information about the absolute (system path) and relative (to the .pqdata file) location of the data:

[11]:
print("f['mod']['rna']['X']")
print(f"  root: {f["mod"]["rna"]["X"].root}")
print(f"  name: {f["mod"]["rna"]["X"].name}")
print(f"  path: {f["mod"]["rna"]["X"].path}")
f['mod']['rna']['X']
  root: /rna
  name: /rna/X
  path: data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/X.parquet
[12]:
table = pa.read_table(f["mod"]["rna"]["obsm"]["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

Generally, the original type of the object that was used to make the table is stored in .schema.metadata:

[13]:
table.schema.metadata
[13]:
{b'array': b'{"shape": [411, 2], "class": {"module": "numpy", "name": "ndarray"}}'}

open() can also work on modalities embedded inside multimodal containers:

[14]:
rna = pqdata.open(f["mod"]["rna"].path)
rna
[14]:
ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna)
[15]:
table = pa.read_table(rna["obsm"]["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

And generally, any part of the hierarchy of the file:

[16]:
rna_obsm = pqdata.open(rna["obsm"].path)
rna_obsm
[16]:
ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/obsm)
[17]:
table = pa.read_table(rna_obsm["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

read_elem()#

Libraries like anndata store custom encoding types with the schema defined by the specification and provide the read_elem() interface to access individual objects like arrays, matrices, tables, etc.

With pqdata, objects like that are stored as tables in Parquet files*, and the original object class is preserved in metadata. This way there’s no need to define custom encoding types and schemas as this is handled by Parquet files.

* simpler entities like scalars are stored in JSON files

[18]:
from pqdata.core import read_elem
[19]:
rna_umap = read_elem(rna_obsm["X_umap"])
assert np.array_equal(umap_embedding, rna_umap)