How To

The sabatini-datajoint-pipeline uses DataJoint for Python.

The workflow is based on relational principles and makes it simple to keep track of the data provenance and to query the data. If you are new to DataJoint, we recommend getting started by learning about the principles and foundations that make up DataJoint. More information can be found in the DataJoint documentation.

We can run the workflow using the provided docker containers (for more information Worker Deployment (WSL)). Or, we can run locally using the provided jupyter notebooks. These notebooks provide a good starting point and can be modified to fit your needs, just remember to check that your kernel is set to the sabatini-datajoint kernel.

Setting up your data directories

Your data /Inbox directory structure will need to be set up like the following:

Subject1
├── Session1

├── Imaging

├── scan0

├── 00001.tif
├── 00002.tif
└── …

├── Photometry

├── timeseries*.mat; data*.mat; .tdt

└── .toml

├── Behavior

├── .toml

└── .parquet, .csv

├── Ephys

└── .bin, .lf, .meta

├── dlc_projects

└── PROJECT_PATH

├── dlc_behavior_videos

└── .avi

├── Session2

└── …

Note that the Subject is the top level directory, and all other data types are nested. You do not need to have all data types for each session.

Inserting into the Subject and Session tables

To initate the workflow, we will first need to populate the Subject, Session, and SessionDirectory tables. There are two ways to do this:

We can use a jupyter notebook/CLI/iPython terminal
We can use the DataJoint LabBook GUI.

Let’s start with the jupyter notebook/CLI/iPython terminal.

import os
if os.path.basename(os.getcwd()) == "notebooks": os.chdir("..")
import datajoint as dj
dj.config.load('dj_local_conf.json')
dj.conn()

from __future__ import annotations
import datajoint as dj
import pandas as pd
import numpy as np
import warnings
from pathlib import Path

from element_interface.utils import find_full_path
from workflow import db_prefix
from workflow.pipeline import session, subject, lab, reference
from workflow.utils.paths import get_raw_root_data_dir

subject.Subject.insert1(dict(subject='Subject1', 
                             sex='M', 
                             subject_birth_date='2021-10-01', 
                             subject_description='notes'))


session_key = dict(subject='Subject5', session_id=1,
                   session_datetime='2021-10-07 12:00:00')

session.Session.insert1(session_key)

session.SessionDirectory.insert1(dict(subject=session_key['subject'], session_id=session_key['session_id'], 
                                      session_dir='Subject5/session1'))
         

Importantly, your Subject, Session, and SessionDirectory structure will need to match the directory structure in your /Inbox directory.

We can also do this through the DataJoint LabBook GUI.

Go to the DataJoint LabBook GUI and login with your credentials.
Now, you will be able to view all the schemas available to you on the left hand side.
Navigate to the sabatini_dj_subject scehma and click on the Subject table.
Click Insert and fill out the form.

Then, navigate to the sabatini_dj_session schema and click on the Session table.
Click Insert and fill out the form.

Lastly, within the sabatini_dj_session schema, click on the SessionDirectory table and fill out the form.

After successful insertion of the Subject, Session, and SessionDirectory tables, we can then proceed with operating the rest of the pipeline.

Inserting into the Reference tables

The Reference tables are used to store information about the mouse (e.g. allele, zygosity) and surgical information (e.g. implants, viral injections). Importantly, each table uses the Subject_ID as the primary key. Therefore, remember your Subject_ID’s!

There are two ways to do this:

Through the DataJoint LabBook GUI
Our python provided GUIs that are run locally.

Here, we will cover how to insert the data through our python provided GUIs. The GUIs will automatically login to the database using the credentials you provided in the dj_local_conf.json file.

To launch the GUIs, you will need to activate the sabatini-datajoint environment and run the following command:

python .\TOML-metafile-scripts\launch.py

You will then be prompted to start the GUI of your choice by a GUI that looks like this:

Make your selection then proceed to the appropriate section below.

The mouse reference table

A GUI will pop up that looks like this:

Fill out the form and click Insert. If successful, you will see a message that says Inserted subject information for: YOUR_SUBJECT_NAME in the command window.

If you need to insert more than one mouse, click Insert another subject and repeat the process. Once finished, you can close the GUI by selecting Quit.

The virus reference table

A GUI will pop up that looks like this:

The Virus table is a reference table that stores information about the virus used for the experiment and treats each hemisphere as a separate entry. For example, if you injected a virus into the left hemisphere and the right hemisphere, you will fill out the appropriate information then select Insert to Right Hemisphere and Insert to Left Hemisphere.

If successful, you will see Inserted to Right Hemisphere or Inserted to Left Hemisphere in the command window.

If you need to insert more than one surgery, click Insert another viral injection and repeat the process. Once finished, you can close the GUI by selecting Quit.

You can also save/load defaults for your experiment by selecting Save Defaults and Load Defaults. This will save your coordinates for repeated use.

The implant reference table

A GUI will pop up that looks like this:

The Implant table is a reference table that stores information about the implant used for the experiment and treats each hemisphere as a separate entry. For example, if you implanted a fiber into the left hemisphere and the right hemisphere, you will fill out the appropriate information then select Insert to Right Hemisphere and Insert to Left Hemisphere.

If successful, you will see Right Implantation inserted into Implantation table or Left implantation inserted into Implantation table in the command window.

If you need to insert more than one surgery, click Insert another implant and repeat the process. Once finished, you can close the GUI by selecting Quit.

You can also save/load defaults for your experiment by selecting Save Defaults and Load Defaults. This will save your coordinates for repeated use.

Photometry and PhotometrySync pipeline

The photometry pipeline is designed to process photometry data from the Sabatini lab from various data acquisition streams (e.g. labjack or TDT). The pipeline is designed to be modular, so that you can run the pipeline on any combination of data types. For example, you can run the pipeline on only the photometry data, or you can run the pipeline on the photometry data and the behavior data to sync the photometry signal.

Input data

You will need a photometry timeseries collected from a labjack (e.g. matlab) or TDT system. You will also need to fill out meta information within the .toml file. More on how to do this in the Creating a .toml file for photometry processing section.

Matlab/Labjack data naming conventions:

There are two processing streams for the matlab input data. The first is raw/unprocessed data, and the second is processed (demodulated) data.

To enter the raw/unprocessed data into the pipeline, the data must be named in the following format: data*.mat. To enter the processed data into the pipeline, the data must be named in the following format: *timeseries*.mat.

The .toml file must be named in the following format: *.toml.

Importantly, the transform field in the .toml file must be set to transform = spectrogram for the matlab data.

TDT data naming conventions:

To enter the TDT data into the pipeline, the data must have all of the associated TDT files *.t* and must also have a .toml file associated with it.

The TDT processing stream can handle two different transform types: transform = spectrogram or transform = hilbert.

transform = hilbert: uses Celia Beron’s processing pipeline.

transform = spectrogram: uses a python version of Bernardo Sabatini’s processing pipeline.

Behavior data naming conventions:

The behavior data must be named in the following format: *.parquet (for transform = spectrogram) or *.csv (for transform = hilbert).

You will also need a meta data file associated with the behavior data. The meta data file must be named in the following format: *.toml. Here, you can also pass an extra paramater called final_z = true/false if you would like to do a final z_score of the photometry data. Importantly, you must set the behavior_offset field in the .toml file to indicate the time offset between the photometry data and the behavior data.

The behavior data must have an event table with a time field and event field.

Running the photometry pipeline

Once you have inserted the Subject, Session, and SessionDirectory tables and you have the appropriate files in place, you can then proceed with running the photometry pipeline by simply upping the standard_worker docker container detailed in Worker Deployment (WSL). It will automatically detect the new data and process it and populate the Photometry table.

You can also run the pipeline manually by running the following:

import os
if os.path.basename(os.getcwd()) == "notebooks": os.chdir("..")
import datajoint as dj
dj.config.load('dj_local_config.json')
dj.conn()

from __future__ import annotations
import datajoint as dj
import pandas as pd
import numpy as np
import warnings
from pathlib import Path
import tomli
import tdt
from copy import deepcopy
import scipy.io as spio
from scipy import signal
from scipy.signal import blackman
from scipy.fft import fft, ifft, rfft

from element_interface.utils import find_full_path
from workflow import db_prefix
from workflow.pipeline import session, subject, lab, reference, photometry
from workflow.utils.paths import get_raw_root_data_dir
import workflow.utils.photometry_preprocessing as pp
from workflow.utils import demodulation

session_key = (session.Session() & "subject='subject'").fetch1("KEY")
session_key

sd_key = dict(session_key, session_dir = r'Photometry/subject/session1')
sd_key

photometry.FiberPhotometry.populate(sd_key)

If you are using the docker container, the pipeline will automatically search and process the behavior data as well. If you are manually running the pipeline, you will need to populate the SyncedPhotometry table by running the following:

import os
if os.path.basename(os.getcwd()) == "notebooks": os.chdir("..")
import datajoint as dj
dj.config.load('dj_local_config.json')
dj.conn()

from __future__ import annotations
import datajoint as dj
import pandas as pd
import numpy as np
import warnings
from pathlib import Path
import tomli
import tdt
from copy import deepcopy
import scipy.io as spio
from scipy import signal
from scipy.signal import blackman
from scipy.fft import fft, ifft, rfft

from element_interface.utils import find_full_path
from workflow import db_prefix
from workflow.pipeline import session, subject, lab, reference, photometry
from workflow.utils.paths import get_raw_root_data_dir
import workflow.utils.photometry_preprocessing as pp
from workflow.utils import demodulation

session_key = (session.Session() & "subject='subject'").fetch1("KEY")
session_key

photometry.FiberPhotometrySynced.populate(session_key)

Class heirarchy and inheritance

@schema
class SensorProtein(dj.Lookup):
    definition = """            
    sensor_protein_name : varchar(16)  # (e.g., GCaMP, dLight, etc)
    """


@schema
class LightSource(dj.Lookup):
    definition = """
    light_source_name   : varchar(16)
    """
    contents = zip(["Plexon LED", "Laser"])


@schema
class ExcitationWavelength(dj.Lookup):
    definition = """
    excitation_wavelength   : smallint  # (nm)
    """


@schema
class EmissionColor(dj.Lookup):
    definition = """
    emission_color     : varchar(10) 
    ---
    wavelength=null    : smallint  # (nm)
    """

@schema
class CarrierFrequency(dj.Lookup):
    definition = """
    carrier_frequency     : smallint 
    ---
    wavelength=null    : smallint  # (nm)
    """


@schema
class FiberPhotometry(dj.Imported):
    definition = """
    -> session.Session
    ---
    -> [nullable] LightSource
    raw_sample_rate         : float         # sample rate of the raw data (in Hz) 
    beh_synch_signal=null   : longblob      # signal for behavioral synchronization from raw data
    """

    class Fiber(dj.Part):
        definition = """ 
        -> master
        fiber_id            : tinyint unsigned
        -> reference.Hemisphere
        ---
        notes=''             : varchar(1000)  
        """

    class DemodulatedTrace(dj.Part):
        definition = """ # demodulated photometry traces
        -> master.Fiber
        trace_name          : varchar(8)  # (e.g., raw, detrend)
        -> EmissionColor
        ---
        -> [nullable] SensorProtein          
        -> [nullable] ExcitationWavelength
        -> [nullable] CarrierFrequency
        demod_sample_rate   : float       # sample rate of the demodulated data (in Hz) 
        trace               : longblob    # demodulated photometry traces
        """
@schema
class FiberPhotometrySynced(dj.Imported):
    definition = """
    -> FiberPhotometry
    ---
    timestamps   : longblob
    time_offset  : float     # time offset to synchronize the photometry traces to the master clock (in second)  
    sample_rate  : float     # target downsample rate of synced data (in Hz) 
    """

    class SyncedTrace(dj.Part):
        definition = """ # demodulated photometry traces
        -> master
        -> FiberPhotometry.Fiber
        trace_name          : varchar(8)  # (e.g., raw, detrend)
        -> EmissionColor
        ---
        trace      : longblob  
        """

Creating a .toml file for photometry processing

To help create a .toml file, we have provided a python GUI that will guide you through the proper creation of the file. You can find the GUI in the TOML-metafile-scripts directory.

To start, open a python terminal and activate your sabatini-datajoint envrionment:

python ./TOML-metafile-scripts/launch.py

Select Make a TOML. A GUI will pop up and you will be able to fill out the relevant information. This will create the proper formatting for the .toml file and is advantageous if you need to edit it in the future.

Importantly, you will need to “insert” the proper information into the “right” and “left” hemisphere fields. The TOML will be created with the proper formatting for the pipeline to process the data and handles the two hemispheres seperately.

Once you have filled out the GUI, hit the Save button. You will then be prompted to save the .toml file in the directory containing your photometry data.

Behavior pipeline

The behavior pipeline is designed to process behavior data that has been preprocessed into a .parquet or .csv file. It is built using the Elements Event and therefore, requires some formatting of the data.

Input data

You will need an eventTable with a time, type, and trial field.

A trialTable that has a minimum of block, session_position.

A blockTable describing block attributes and a minimum of firstTrial or start_trial and lastTrial or end_trial .

Have events that may not be considered within a trial? Not to worry, you can add a column in eventTable called inTrial and set it to 0 or 1.

Running the behavior pipeline

Once you have inserted the Subject, Session, and SessionDirectory tables and you have the appropriate files in place, you can then proceed with running the behavior pipeline by simply upping the standard_worker docker container detailed in Worker Deployment (WSL). It will automatically detect the new data and process it and populate the Event table.

You can also run the pipeline manually by running the following:

session_key = (session.Session() & "subject='subject'").fetch1("KEY")
session_key

sd_key = dict(session_key, session_dir = r'Photometry/subject/session1')
sd_key

ingestion.BehaviorIngestion.populate(sd_key)

Ephys pipeline

The ephys pipeline is designed to process neuropixel data acquired with SpikeGLX. It will run through Kilosort2.5 and use ecephys for post-processing. Currently, we have two workflows for processing the data: a docker container or a manual pipeline through the provided jupyter notebook.

Input data

You will need all of the output files from SpikeGLX: .ap.bin, .lf.bin, .ap.meta, and .lf.meta. You can also use data that you have pre-processed throught CatGT.

Running the ephys pipeline through the docker container

Once you have inserted the Subject, Session, and SessionDirectory tables and you have the appropriate files in place, you can then proceed with running the ephys pipeline by simply upping the spike_sorting_local_worker docker container detailed in Worker Deployment (WSL). It will automatically detect the new data and process it and populate the EphysRecording, CuratedClustering, WaveformSet, and LFP tables.

Running the ephys pipeline manually

We have provided an ephys jupyter notebook that will guide you through the ephys pipeline. Importantly, you will have to configure your spike sorter of choice and the paths to the data in the notebook.

Ephys jupyter notebook.

Table organization

The following tables will be populated after running the ephys pipeline:

ephys.EphysRecording(), ephys.CuratedClustering(), ephys.WaveformSet(), ephys.LFP()

Calcium imaging pipeline

The calcium imaging pipeline is designed to process calcium imaging data through Suite2P. It will automatically populate the /Outbox directory with the processed data.

Input data

You will need all of the .tif files output from ScanImage.

Running the calcium imaging pipeline

Once you have inserted the Subject, Session, and SessionDirectory tables and you have the appropriate files in place, you will need to first up the standard_worker docker container detailed in Worker Deployment (WSL). Then, you will need to up the calcium_imaging_worker docker container.

A simple way to run the pipeline is to run the provided Calcium Imaging Jupyter Notebook.

Table organization

The calcium imaging processing pipeline will populate the imaging table.

DeepLabCut pipeline

The DeepLabCut pipeline is designed to process and annotate videos through DeepLabCut. We have updated the workflow so that you can run DeepLabCut from beginning to end through the provided jupyter notebook.

Input data

Once you have created your project_folder, it is important that you place it in /Inbox/dlc_projects/PROJECT_PATH. You will also need to have the videos you would like to process organized in the following format: /Inbox/Subject/dlc_behavior_videos/*.avi.

Running the DeepLabCut pipeline

This is a manual pipeline. You will need to run the provided https://github.com/bernardosabatinilab/sabatini-datajoint-pipeline/blob/5b157f564b1989107c2dd495b2bbf5d7a88d2f8b/notebooks/dlc.ipynb. You will need to edit all of the relevant information and paths in the notebook.

Table organization

The DeepLabCut processing pipeline will populate the model and train tables.

General pipeline architecture

For any questions regarding the pipeline architecture, the whole pipeline can be visualized in our GitHub page.

For pipelines that were designed using DataJoint Elements (e.g. Event, Ephys, Calcium Imaging, DLC), more information can be found in the DataJoint Elements documentation.