Why subsample?
In Single Cell Portal, all Cluster files with more than 1000 individual cells are subsampled immediately after being ingested into the portal database and stored alongside the full resolution data. This improves visualization performance - attempting to render hundreds of thousands (or millions) of individual points in a browser window can take a while, depending on the speed of the connection, and available resources on your computer (RAM on your computer, CPU/video processing power, etc). Additionally, if millions of points were actually plotted on your screen, you would only see a small fraction of them (as most points will be plotted over by other points).
In order to show cluster-based visualizations more quickly, cluster files are subsampled at various different thresholds (1000, 10000, 20000, and 100000 points) depending on the number of cells contained in the original cluster file. This allows users to see cluster-based visualizations more quickly, and if a greater resolution is desired in the plot, they can load higher thresholds, or visualize all points by selecting "All Cells" from the "Subsampling threshold" dropdown menu.
How are cluster files subsampled?
Cluster files are subsampled randomly within provided metadata values to ensure that all groups of cells annotated with that metadata are present after subsampling. To achieve this, there are three required elements for subsampling:
- a cluster file,
- an annotation from the study's metadata file, or from the cluster file itself (a map of cells to individual annotation values),
- a total amount of cells to sample (the threshold)
For any combination of cluster file, metadata (annotation), and threshold, the subsampling methodology is the same:
1. Dividing cells into groups
All of the cells (and associated coordinates) are separated into distinct populations based on cluster metadata. For instance, if the metadata is called "organ", and it has 4 distinct values (brain, heart, lung, and liver), all of the cells - along with their x, y, and z coordinates - are separated into those 4 groups. If the annotation is a continuous value (indicated as numeric in the cluster file), then the values are ordered and separated into 20 different windows. In both cases, this ensures the entire range of possible values will be present after subsampling from those groups into subsampled bins.
2. Computing the sampling size
Once the cells/points are in groups, we can determine the number of cells we want to sample from each group population for the subsampled bin. This is simply the sampling threshold divided by the number of groups. From our earlier example of "organ", if we are using a threshold of 1000, this works out to 1000 cells / 4 groups or 250 cells in each bin. In the case of the continuous values, the sampling size would be 1000 cells / 20 windows or 50 cells in each bin.
3. Sorting groups by size
Before subsampling begins, the last step is to sort the bins based on their group size, from the smallest to the largest group. This ensures that smaller groups are well represented, and determines the order in which each group is sampled. For instance, in our example, let us say that the cluster file has 2000 total cells, and the original "organ" metadata divides up cells as such:
- Brain: 500 cells
- Heart: 400 cells
- Lung: 100 cells
- Liver: 1000 cells
Here the order of sampling would be (from the smallest to the largest group): Lung, Heart, Brain, and Liver. For continuous values, the populations are identical in size if they are equally divisible by 20 (windows). In cases in which they are not, one group will be slightly larger as it would contain the remainder of the leftover cells from the division.
4. Sample groups into bins
With this information, we can begin sampling cells from groups into bins. This is achieved by shuffling the cells/coordinates and then selecting cells up to the requested number of cells. If the requested amount is larger than the size of the entire group being sampled, then the entire group of cells is used.
5. Recompute sampling size and repeat
Some of the groups of cells may contain fewer cells than would be required to evenly sample between all bins. To detect and compensate for the difference between the actual number of cells in a group and the needed number of cells to sample into a bin, we sample from smallest to largest group and, if needed, compensate by sampling more cells in later (larger) groups to make sure we hit the exact threshold of sampling that was requested. Specifically, after each bin is sampled we calculate how many cells need to be sampled to reach our threshold and evenly distribute that number over the remaining bins. Using our current example, our first cell group (Lung) had only 100 cells, and our requested bin size was 250. Therefore, we sampled the full group (Lung), and now have 900 cells remaining to sample with only 3 groups left. This makes our new sampling size of 300 cells per bin (900 cells / 3 remaining groups). The next group is then sampled, and we recompute and repeat until we sample the last group (for which we simply sample the number of cells left to meet the threshold). In the end, our example will end up being sampled as this:
- Lung: 100 cells, sampling size: 250 (1000 / 4), 100 sampled, 900 left
- Heart: 400 cells, sampling size: 300 (900 / 3), 300 sampled, 600 left
- Brain: 500 cells, sampling size: 300 (600 / 2), 300 sampled, 300 left
- Liver: 1000 cells, sampling size: 300 (300 / 1), 300 sampled, 0 left
Total subsampled cells: 1000
This process allows the portal to dynamically compensate for the actual size of cell groups and, in real-time, distribute the remaining sampling over the remaining bins while assuring we sample exactly the threshold requested.
Implementation notes
In the case where the metadata being viewed comes from the study's metadata file, the metadata values themselves do not need to be subsampled, as the data is already accessible in a key-value form, which easily allows for the selection of requested cells from a cluster file when rendered.
However, if the metadata being viewed is from the cluster file directly (e.g. extra columns of annotation values after the coordinates), these values are also sampled along with the cells and coordinates. This is due to how clustering data is represented in the portal database (arrays of values like coordinates, cells, or annotations, rather than individual values).
Source code
The source code used for subsampling is a part of the Single Cell Portal Ingest Pipeline, which is Python-based, and deployed via a Docker container launched through the Google Pipelines API once a file has been uploaded to the portal. Subsampling methods can be found in the subsample.py module. Specifically, it leverages the NumPy and pandas libraries to efficiently bin and sample data.
Comments
0 comments
Please sign in to leave a comment.