Search

RNA seq analysis

Instructions for performing differential

expression analysis from .fastq.gz

files (for Mac): by Igor Dolgalev (NYU)

 

Getting ready: you only need to do these steps once

  1. Request access to BigPurple, NYULMC’s High Performance Computing (HPC) cluster

    a. Email hpc_admins@nyumc.org with the subject line “HPC Account Request.”

    Include the following information in the body of the email:

    i. the new user’s full name

    ii. the new user’s Kerberos ID (usually the user’s initials or part of the user’s name followed by a number of digits)

    iii. the new user’s email address (to receive important announcements via our mailing list)

    iv. the new user’s department or center and division or section

    v. the name of the new user’s principal investigator

    vi. the title of the project for which HPC resources are to be used

    vii. a brief description of the research being conducted by the new user

    viii. a brief description of the HPC needs, such as data storage, computing power, data transfers, remote access (from outside the MCIT network), and so on

  2. Download Cyberduck (https://cyberduck.io/download/) - you will use this to easily navigate between directories and upload/download files

  3. Create a Github account (https://github.com)

  4. Once you’ve been granted BigPurple access, open Cyberduck and click the “Open Connection” button in the top left corner

  5. Enter the following information (username and password is your Kerberos ID and password):

  6. Once the connection is open, I recommend creating a bookmark for

    /gpfs/home/<Kerberos ID> as you will want to navigate to this directory easily

 

 

Creating directories for your RNA-seq files and downloading the Seq-NSlide (SNS) pipeline

  1. Download all the .fastq.gz files to your hard drive

  2. In Cyberduck, make a new directory (folder) in /gpfs/scratch/<Kerberos ID> called “fastq”. This is where all your raw fastq.gz files will live, but you will not make any changes inside this directory during RNA-seq analysis

  3. Using Cyberduck, drag and drop the .fastq.gz files from your hard drove to the /gpfs/scratch/<Kerberos ID>/fastq directory. This will take some time to complete. Once finished, you can remove the .fastq.gz files from your hard drive.

  4. Using Cyberduck, navigate to /gpfs/home/<Kerberos ID> and create a new folder called “<date>_RNA_seq_analysis”. This is where all the outputs from the seq analysis will be deposited.

  5. Now you must download all the Seq-N-Slide scripts/etc and place them in your

    /gpfs/home/<Kerberos ID>/<date>_RNA_seq_analysis directory:

    a. Open Terminal, and type the following:

    ssh <Kerberos ID>@bigpurple.nyumc.org

    b. Enter your password when prompted, then press Return – you should now see the BigPurple logo in your Terminal window

    c. In the Terminal window, type module avail git -- and press return – this will show you which version of the github module to load –

  6. it should return something like “git/2.17.0”

    d. In the Terminal window, type

    module load git/2.17.0

    and press return. The github module is now loaded which will allow you to download all the SNS scripts from github.com

    e. In the Terminal window, navigate to your SNS directory by typing

    cd /gpfs/home/<Kerberos ID>/<date>_RNA_seq_analysis

    and pressing Return. You will stay in this directory for the following steps unless specified.

    f. Next, type the following and press Return to download the SNS scripts:

    git clone --depth 1 https://github.com/igordot/sns

    (Note: you will be prompted to provide your Github login credentials to do this step)

 

Running the SNS pipeline on your RNA-seq data

  1. Create a sample sheet of your files. In Terminal, navigate to your SNS directory (see 5.d. above for help). Then, type the following command and press Return:

    sns/gather-fastqs /gpfs/scratch/<Kerberos ID>/fastq

    This will tell the SNS pipeline to search your fastq directory for your raw sequence files and will generate a spreadsheet of your samples. It will also pair your R1 and R2 mates

    if you performed paired-end sequencing. At this point, you can refresh Cyberduck and navigate to your SNS directory, where you should see a file called “samples.fastqraw.csv.” You can double-click this to download it and view it in Excel.

  2. Next, you must specify the reference genome (usually the Drosophila R6 release), by typing the following command into your Terminal window and pressing Return:

    sns/generate-settings dm6

    Here, “dm6” refers to the Dme r6 reference genome. You could similarly use “hg19” for human genome RNA-seq data, “mm10” for mouse, etc.

  3. Here is the actual step where you perform trimming and alignment to the reference genome! Type the following command into your Terminal window and press Return:

    sns/run rna-star

    This is a very compute-heavy process, so the SNS pipeline sends this job to the BigPurple cluster to execute. For my 3 samples * 3 triplicates, this took BigPurple a

    couple of hours. To check on the status of your job, type the following into your Terminal window and press Return:

    watch squeue –u <Kerberos ID>

    Your job(s) will either be Pending (PD), Running (R) or Completed (C) and the window will refresh every 2 seconds. To exit this window, press control+c.

  4. To check for potential errors, type the following command into your Terminal window and press Return:

    grep “ERROR:” logs-sbatch/*

    If there are no errors, you’re now ready to perform differential expression analysis.

  5. (optional) You can also download your .bam and .bam.bai (and .bw) files at this point and load them into the IGV browser to look at the distribution of reads across your

    genes of interest. (Note: you may need to refresh your Cyberduck window to see these new files)

 

Performing differential expression analysis

  1. In your SNS directory, you should notice a file named samples.groups.csv. Download this file, define your group names and re-upload it to the same directory by dragging and

    dropping into Cyberduck. Be sure to remove the original samples.groups.csv from your SNS directory!

  2. Now it’s time to perform the Differential Expression analysis. In your SNS directory, type the following command into your Terminal window and press Return:

    sns/run rna-star-groups-dge

    This only takes a few minutes. The output files (heatmaps, PCA plots, tables of differentially expressed genes, etc.) will be placed in a subfolder in your SNS directory. Refresh Cyberduck to see this folder and view its contents. One of the outputs should be a deseq2.dds.rds file (i.e., an R object) that you can use to custom-make graphs in

    R if you want.

  3. Download all the relevant output files to your hard drive.

    For more information about Seq-N-Slide, go to https://github.com/igordot/sns