Zhuang laboratory mouse brain example

Overview

The goal of this example is to run merfish3d-analysis on an existing 2D MERFISH dataset generated by the Zhuang laboratory at Harvard. They graciously deposisted their nearly raw data on BIL and we can adapt the fully featured functionality of our package to re-process their data. This is a 22-bit MERFISH experiment that has been pre-registered across rounds within each tile.

Preliminaries

You need to make sure you have a working python environment with merfish3d-analysis properly installed, Baysor properly installed, and the Zhuang laboratory mouse brain 2D MERFISH dataset downloaded. The dataset is ~760 GB and you will need roughly another 700 GB of temporary space to create the qi2labDataStore structure we use to perform tile registration, global registration, segmentation, pixel decoding, filtering, and cell assignment.

We are only going to analyze one of their samples, specifically mouse1_sample1_raw.

merfish3d-analysis
Baysor
Raw 2D MERFISH data. Download the additional_files/ folder, mouse1_sample1_raw/ folder, and processed_data/spots_mouse1sample1.csv.

Downloading the data

All of the required code to process this data is in the examples/zhuang_lab folder on merfish3d-analysis github repo. After downloading the data from BIL, there should be the following data structure on the disk:

/mop
  ├── mouse1_sample1_raw/
    ├── aligned_images1.tif
    ├── aligned_images2.tif
    ├── ...
    └── aligned_images447.tif
  ├── additional_files/
    ├── codebook.csv
    ├── data_organization_raw.csv
    ├── microscope.json
    ├── fov_positions/
      └── mouse1_sample1.txt
  └── processed_data/
    └── spots_mouse1sample1.csv

Processing non-qi2lab data

Because this data is generated by a different custom microscope design with a unique microscope control package and is already pre-processed, we have to manually write most of the data conversion steps. Please see the DataStore page for more information on the key parameters that are requred to create a qi2labDataStore.

The data in this BIL dataset consists of six small section of mouse brain, all separated on the slide. Looking at the additional_files/fov_positions/mouse1_sample1.txt file, we can find where there are discontinuous jumps in the stage positions. Those jumps indicate the stage moved to a new brain section, for example lines 58-61 in the file show,

58 2404.87, 1766.11
59 2204.87, 1366.11
60 -309.64, 2741.84
61 -309.64, 2541.84

So we know tiles 1 through 59 are the first continuous tissue slice area.

A key issue with this data is the stage direction and camera orientation are flipped, which can be quite confusing when trying to figure out how the tiles are spatially related to each other. The first step is to load the first two tiles with the known spatial transformations by setting line 167 in 00a_test_image_orientation.py to n_tiles=2. This will generate two plots, showing the result if we interpret the positions in the additional_files/fov_positions/mouse1_sample1.txt file as yx or xy order.

From the overlap area, it appears that the xy order interpreation (on the right) is correct. To check more thoroughly, we can set line 167 in 00a_test_image_orientation.py to n_tiles=59. This will plot the entire first tissue slice.

The check with the first tissue slice on the slide confirms that we need to interpret the stage positions in xy order, while we know Python intreprets the data in the tif files as yx order. To account for this and correctly register the data, we have to swap the xy axes. This can be accomplished by swapping the last two axes of the tif,

flipped_image = np.swapaxes(raw_image,axes=(-2,-1))

or by swapping the xy positions in the array generated from additional_files/fov_positions/mouse1_sample1.txt,

stage_positions_um = np.swapaxes(stage_positions_um,axes=(-2,-1))

Another difference in this data is that the spacing between z-planes is 1.5 microns, quite a bit larger than the Nyquist sampling of ~0.3 microns given the objective NA. It does not make sense to perform 3D deconvolution or decoding for this large of a sampling, so each z plane is deconvolved and decoded as independent from the surrounding planes. At the end, we collapse decoded transcripts that show up in adajacent Z planes to the transcript with the largest brightness.

Finally, much of the metadata information we need (refractive index, numerical aperture, wavelengths, camera specifications, etc...) is only easily available via the publication, Spatially resolved cell atlas of the mouse primary motor cortex by MERFISH. We have noted in the conversion script where we had to look up these values or find values from equipment suppliers.

While we have already created the data conversion code for this example, please reach out with questions if the process is not clear.

Processing steps

For each of the python files in the examples/zhuang_lab directory, you will need to scroll to the bottom and replace the path with the correct path. For example, in 01_convert_to_datastore.py you'll want to change this section:

if __name__ == "__main__":
    root_path = Path(r"/path/to/download/mop")
    baysor_binary_path = Path(
        r"/path/to/Baysor/bin/baysor/bin/./baysor"
    )
    baysor_options_path = Path(
        r"/path/to/merfish3d-analysis/examples/zhuang_lab/zhuang_mouse.toml"
    )
    julia_threads = 20 # replace with number of threads to use.

    hot_pixel_image_path = None

    convert_data(
        root_path=root_path,
        baysor_binary_path=baysor_binary_path,
        baysor_options_path=baysor_options_path,
        julia_threads=julia_threads,
        hot_pixel_image_path=hot_pixel_image_path
    )

For all of the files in processing_code, you'll set the root_path to root_path = Path(r"/path/to/download/mop"). The package automatically places the datastore within that directory, /path/to/download/mop/qi2labdatastore.

Once that is done, you can run 01_convert_to_datastore.py and 02_register_and_deconvolve.py without any interactions. Depending on your computing hardware, you should expect ~12 hours for 01_convert_to_datastore.py and about a week for 02_register_and_deconvolve.py. The second file is extremely compute intensive, because performs 2D deconvolution each z-plane of the 3 channel z-stacks at each tile, registers the polyDT channel back to the first round, runs U-FISH for the MERFISH bits in each tile, and finally performs global registration for all polyDT tiles in the first round. On a 12-core/24-thread AMD CPU workstation with 128 GB RAM and an Nvidia RTX 3090 24 GB GPU this step takes about 8 days to complete for all 441 tiles of shape [40,7,2048,2408] in the dataset.

Once 02_register_and_deconvolve.py is finished, you will need to create the correct cellpose settings. We have found that initially peforming cellpose segmentation on a downsampled and maximum Z projected polyDT image is sufficient to seed Baysor for segmentation refinement.

If the standard "cyto3" model does not work for you data, you may need to retrain the cellpose model and pass that to the code. Please see the API documentation for how to perform that step. Given satisfactory cellpose settings, you fill them in at the bottom of 03_cell_segmentation.py,

if __name__ == "__main__":
    root_path = Path(r"/path/to/download/mop")
    cellpose_parameters = {
        'normalization' : [0.5,95.0],
        'blur_kernel_size' : 2.0,
        'flow_threshold' : 0.4,
        'cellprob_threshold' : 0.0,
        'diameter': 36.57
    }
    run_cellpose(root_path, cellpose_parameters)

and then run 03_cell_segmentation.py. This will only take a an hour or so to generate the initial 2D segmentation guess. Our package rewrites the cell outlines using the ImageJ ROI file structure in global coordinates.

Next, you'll run 04_pixeldecode_and_baysor.py to first optimize the pixel-based decoding parameters on a subset of tiles, then perform pixel-based decoding for all tiles, filter the data to limit false positives, remove overlapping spots in adajacent spatial tiles, and finally re-segment the cell boundaries in 3D using Baysor. This step should take ~0.5-1 week, depending on your hard disk and GPU configuration.