接着上文继续学习 Anndata 的相关语法, 参考网页
AnnData
is specifically designed for matrix-like data. By this we mean that we have n
observations, each of which can be represented as d
-dimensional vectors, where each dimension corresponds to a variable or feature. Both the rows and columns of this matrix are special in the sense that they are indexed.
For instance, in scRNA-seq data, each row corresponds to a cell with a barcode, and each column corresponds to a gene with a gene id. Furthermore, for each cell and each gene we might have additional metadata, like (1) donor information for each cell, or (2) alternative gene symbols for each gene.
# Initializing AnnData
1 | counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32) |
# Adding aligned metadata
1 | ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs)) |
# Observation / variable-level matrices
We might also have metadata at either level that has many dimensions to it, such as a UMAP
embedding of the data. For this type of metadata, AnnData has the .obsm/.varm
attributes. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm
are that .obsm
matrices must length equal to the number of observations as .n_obs and .varm
matrices must length equal to .n_vars
. They can each independently have different number of dimensions.
1 | adata.obsm["X_umap"] = np.random.normal(0,1,size=(adata.n_obs,2)) |
A few more notes about .obsm/.varm
The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.
When using scanpy, their values (columns) are not easily plotted, where instead items from .obs are easily plotted on, e.g., UMAP plots.
# Unstructured metadata
AnnData has .uns
, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.
1 | adata.uns["random"] = [1, 2, 3] |
# Layers
Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let’s log transform the original data and store it in a layer:
# Conversion to DataFrame
We can also ask AnnData to return us a DataFrame from one of the layers:
1 | adata.to_df(layer="log_transformed") |
We see that the .obs_names/.var_names are used in the creation of this Pandas object.
# Writing the results to disk
AnnData comes with its own persistent HDF5-based file format: h5ad. If string columns with small number of categories aren’t yet categoricals, AnnData will auto-transform to categoricals.
1 | adata.write('my_results.h5ad', compression="gzip") |
# Views and copies
Imagine that the observations come from instruments characterizing 10 readouts in a multi-year study with samples taken from different subjects at different sites. We’d typically get that information in some format and then store it in a DataFrame:
1 | obs_meta = pd.DataFrame({ |
This is how we join the readout data with the metadata. Of course, the first argument of the following call for X could also just be a DataFrame.
1 | adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var) |
1 | # View |
# Partial reading of large data
If a single .h5ad
is very large, you can partially read it into memory by using backed mode:
1 | adata = ad.read('my_results.h5ad', backed='r') |
If you do this, you’ll need to remember that the AnnData object has an open connection to the file used for reading:
1 | adata.filename |
As we’re using it in read-only mode, we can’t damage anything. To proceed with this tutorial, we still need to explicitly close it:
1 | adata.file.close() |
As usual, you should rather use with statements to avoid dangling open files (up-coming feature).
Manipulating the object on disk is possible, but experimental for sparse data. Hence, we leave it out of this tutorial.
上面包含了一些非常基础的操作,考虑到时间关系和后面的内容有些难,我先去学习一下 scanpy 的操作流程