This R Markdown document provides the analytical workflow required to process and interpret the sequencing data generated during the wet lab.
The structure, format, and assessment criteria of the written report and oral presentation are defined in the official course guides. This document focuses on the analytical components that generate the data and figures used in those assessments.
This practical supports the course Learning Outcomes as follows:
Part A is conducted using the sequencing data
generated from your own wet lab experiment.
The objective is to directly compare how different physical DNA
treatments (Control, Vortex, Needle, Temperature stress) influence
sequencing performance metrics, including read length distribution,
sequencing yield, and base quality before and after filtering.
This section establishes the causal link between laboratory manipulation
and sequencing outcomes.
Parts B and C extend the analysis to mapping,
structural variant detection, and biological interpretation.
Because full mapping and variant calling workflows on complete raw
datasets are computationally intensive and require extended runtime on
HPC infrastructure, a pre-processed and region-restricted dataset
derived from previous sequencing runs is used for these sections.
This ensures that all students can complete the advanced analytical workflow within the allocated practical session while working with high-quality, reproducible example data suitable for interpretation in IGV and downstream biological analysis.
Together, Part A and Parts B–C form an analytical
progression:
from experimental manipulation and quality assessment (wet-to-sequencing
outcomes) to genomic localization and biological interpretation
(sequencing-to-function).
Part A Using the read quality values before and after filtering in the shared table (in Part A) Examine the effect of different DNA treatment methods on read quality and length.
Part C Select one structural variant locus from the table provided (in Part C) and investigate it in IGV. Report the genomic coordinates, variant type and size, genotype comparison between the 2024 and 2025 samples, whether the variant overlaps any genes, and briefly discuss the potential functional consequence for the affected gene.
To quantitatively evaluate how different physical DNA treatments (Control, Vortex, Needle shearing, and Temperature extreme) influence long-read sequencing performance by comparing read length distribution, sequencing yield, and base quality metrics before and after quality filtering.
The FASTQ files analyzed in this section originate from four treatments performed during the wet laboratory session:
Following DNA extraction, concentrations were quantified using Qubit
fluorometry.
Samples were normalized to equal DNA mass per barcode, barcoded
individually, pooled, and sequenced using Oxford Nanopore
technology.
In the Terminal/Command prompt, go to Sigma2 and your directory there.
ssh yourID@saga.sigma2.no
To not get naked in the lobby, create a vritual TMUX
terminal
tmux new -s EUK
and run an interactive job for the class:
srun \
--account=nn9987k \
--partition=normal \
--gres=localscratch:50G \
--cpus-per-task 4 \
--nodes 1 \
--mem=10G \
--time=02:00:00 \
--pty bash \
-i
Let’s make a directory for analysis and enter it.
cd /cluster/projects/nn9987k/$USER
mkdir analysis # make directory "analysis"
cd analysis # set the current directory "analysis"
Now, you will inspect the fastq file from your experiment, which contains Nanopore read information.
Before proceeding with the analysis, review the metadata corresponding to your assigned group.
Please identify:
You can find the complete overview of samples and barcodes in the shared spreadsheet
Confirm that the FASTQ file you are working with matches your assigned barcode and treatment condition before running NanoPlot.
All the data is now here:
/cluster/projects/nn9987k/BIO326-2025/data_euk_2026/merged_fastq
Concatenated fastq files are named according to barcode NB:
NB01.fq.gz NB03.fq.gz NB05.fq.gz NB07.fq.gz NB09.fq.gz NB11.fq.gz NB13.fq.gz NB15.fq.gz NB17.fq.gz NB19.fq.gz NB02.fq.gz NB04.fq.gz NB06.fq.gz NB08.fq.gz NB10.fq.gz NB12.fq.gz NB14.fq.gz NB16.fq.gz NB18.fq.gz NB20.fq.gz
Open your FASTQ file and examine its structure.
zcat YourSample.fq.gz | head -20
FASTQ format uses 4 lines per read: 1. Header 2. Sequence 3. “+” 4. Quality string
FASTQ files contain both nucleotide sequences and per-base quality scores
zcat YourSample.fq.gz | wc -l
FASTQ uses 4 lines per read.
You can calculate the number of reads with:
LINES=$(zcat YourSample.fq.gz | wc -l)
echo $((LINES / 4))
The total number of reads reflects sequencing yield. This will later be compared between treatments and after filtering.
zcat YourSample.fq.gz | sed -n 4p | cut -c 1-5
zcat YourSample.fq.gz | sed -n 4p | cut -c 21-25
Quality scores are encoded using Phred+33. Quality may vary along the read due to sequencing chemistry.
zcat YourSample.fq.gz | awk 'NR%4==2 && length($0) > 10000 {print; exit}'
Long reads are a key advantage of Nanopore sequencing. Read length distribution will later be compared across treatments.
The original fastq files may contain low quality reads. In this step, we will use “Nanoplot” to see the quality and length of each read.
Make a slurm script to conduct the quality check on the sample file like below and run it.
Review: make a slurm script and run it by sbatch
#!/bin/bash
#SBATCH --job-name=Nanoplot # sensible name for the job
#SBATCH --mem=12G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --output=nanoplot_before
#SBATCH --account=nn9987k
#SBATCH --time=20:00
##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate
conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY
echo "Working with this $CONDA_PREFIX environment ..."
## run nanoplot
NanoPlot -t 8 --fastq /cluster/projects/nn9987k/BIO326-2025/day1/1-A.fq.gz --plots dot --no_supplementary --no_static --N50 -p before_filter_A1
Run the script by:
sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/Nanoplot.before.euk.SLURM.sh <SAMPLE>
NB! Remember to use your sample ID for example: NB01 in
Nanoplot will generate the result files, named “before_filter.”xxx. Let’s look into them…
# Taking too long?
cp /cluster/projects/nn9987k/BIO326-2025/day1/before_filter_A1NanoPlot-report.html before_filter_A1NanoPlot-report.html
Open “before_filter_A1NanoPlot-report.html” on your local computer. Review: File transfer between Sigma2 and your computer
Together, these metrics capture the trade-off between yield and quality before and after filtering.
Now you will filter the obtained reads by length and read quality.
#!/bin/bash
#SBATCH --job-name=Nanofilt # sensible name for the job
#SBATCH --mem=12G
#SBATCH --ntasks=1
#SBATCH --output=nanofilt
#SBATCH --account=nn9987k
#SBATCH --time=20:00
##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate
conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY
echo "Working with this $CONDA_PREFIX environment ..."
## run nanoplot
gunzip -c /cluster/projects/nn9987k/BIO326-2025/day1/1-A.fq.gz | NanoFilt -q 12 -l 1000 | gzip > cleaned_A1.fq.gz
To run let’s do:
sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/NanoFilt.euk.SLURM.sh <SAMPLE>
NB! Remember to use your sample ID for example: NB01
This script activates a Conda environment (Your toolkit that Arturo pre-assembled) on a computing cluster:
This script activates a Conda environment on a computing cluster.
module load StdEnv
Loads the standard software environment.
module load Miniconda3/23.10.0-1
Loads Miniconda, a lightweight Conda distribution.
source ${EBROOTMINICONDA3}/bin/activate
Activates the base Miniconda installation.
conda activate EUK_DRY
Activates the project-specific Conda environment.
echo "Working with this $CONDA_PREFIX environment ..."
Prints the path of the currently active Conda environment.
-l, Filter on a minimum read length
-q, Filter on a minimum average read quality score
In this case, we are removing reads lower than quality score 12 and shorter than 1000 bases.
If you are ambitious, please adjust the filtering parameters and see how they change the result.
(In that case, do not forget to name the result files differently.)
Run Nanoplot again on the cleaned sequences, and compare the quality metrics before and after the filtering.
#!/bin/bash
#SBATCH --job-name=Nanoplot # sensible name for the job
#SBATCH --mem=12G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --output=nanoplot_after
#SBATCH --account=nn9987k
#SBATCH --time=20:00
##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate
conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY
echo "Working with this $CONDA_PREFIX environment ..."
## run nanoplot
NanoPlot -t 8 --fastq cleaned_A1.fq.gz --plots dot --no_supplementary --no_static --N50 -p after_filter_A1
To run let’s do:
sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/Nanoplot.cleaned.euk.SLURM.sh <SAMPLE>
NB! Remember to use your sample ID for example: NB01
Open “after_filter.$SAMPLENanoPlot-report.html” on your local computer.
Did you see the difference of read and quality distribution between before and after the filtering? Discuss within/between groups.
Your mission
After running NanoPlot on your dataset (before and after filtering), extract the following summary statistics and enter them into the shared spreadsheet:
Record these values for both:
Use the NanoPlot summary report (.html or .txt) to retrieve the
metrics.
Ensure that you enter values corresponding to your assigned sample and
treatment.
The purpose of this activity is to compare sequencing yield and quality before and after filtering across treatments.
To understand the core logic of read mapping and structural variant
calling by running (or inspecting) a minimal long-read pipeline: 1) map
reads to a reference genome,
2) generate a sorted/indexed BAM file,
3) call structural variants (SVs) into a VCF file,
4) inspect variant records and interpret key fields (SVTYPE, SVLEN,
GT).
#!/bin/bash
#SBATCH --job-name=Sniffles 🐽
#SBATCH --mem=12G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --output=sniffles_%j.out
#SBATCH --account=nn9987k
#SBATCH --time=20:00
########################################
# 🧬 Usage check
########################################
if [ -z "$1" ]; then
echo "❌ ERROR: You must provide an OUTPUT DIRECTORY!"
echo "👉 Usage: sbatch sniffles.sh /path/to/output_directory"
exit 1
fi
OUTDIR=$1
echo "📂 Output directory: $OUTDIR"
# Create directory if it doesn't exist
mkdir -p "$OUTDIR"
########################################
# 🐍 Activate conda environment
########################################
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate
conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY
echo "🐍 Using conda environment: $CONDA_PREFIX"
echo "🖥️ CPUs: $SLURM_CPUS_PER_TASK"
echo "--------------------------------------"
########################################
# 🧭 Define input files (you can later make these arguments too)
########################################
REF=/cluster/projects/nn9987k/BIO326-2025/day2/Bos_taurus_chr13.fa
READS=/cluster/projects/nn9987k/BIO326-2025/day2/2025_chr13_teaching_20kb.fastq.gz
PREFIX=2025_chr13
########################################
# 🧬 Mapping
########################################
echo "🧬 Mapping with minimap2..."
minimap2 -t $SLURM_CPUS_PER_TASK -ax map-ont \
"$REF" "$READS" \
> ${OUTDIR}/${PREFIX}.sam
echo "✅ Mapping done"
########################################
# 🔄 SAM ➜ BAM
########################################
echo "🔄 Converting to BAM..."
samtools view -@ $SLURM_CPUS_PER_TASK -S -b \
${OUTDIR}/${PREFIX}.sam > ${OUTDIR}/${PREFIX}_temp.bam
echo "📊 Sorting BAM..."
samtools sort -@ $SLURM_CPUS_PER_TASK \
${OUTDIR}/${PREFIX}_temp.bam \
-o ${OUTDIR}/${PREFIX}.bam
echo "📌 Indexing BAM..."
samtools index ${OUTDIR}/${PREFIX}.bam
########################################
# 🧪 Variant Calling
########################################
echo "🧬 Calling structural variants with Sniffles..."
sniffles \
--input ${OUTDIR}/${PREFIX}.bam \
--vcf ${OUTDIR}/${PREFIX}.vcf \
--threads $SLURM_CPUS_PER_TASK
echo "🎉 Variant calling finished!"
echo "📄 VCF file: ${OUTDIR}/${PREFIX}.vcf"
########################################
# 🧹 Cleanup
########################################
echo "🧹 Cleaning temporary files..."
rm -f ${OUTDIR}/${PREFIX}.sam
rm -f ${OUTDIR}/${PREFIX}_temp.bam
echo "✨ Pipeline completed successfully!"
Run:
cd /cluster/projects/nn9987k/$USER && mkdir sniffles && cd sniffles && sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/sniffles.SLURM.sh /cluster/projects/nn9987k/$USER/sniffles
Workflow Overview
| Command | Purpose |
|---|---|
minimap2 ... > 2025_chr13.sam |
Maps long reads to the reference genome and outputs SAM format. |
samtools view ... > 2025_chr13_temp.bam |
Converts SAM to compressed BAM format. |
samtools sort ... -o 2025_chr13.bam |
Sorts alignments by coordinate for downstream compatibility. |
samtools index ... 2025_chr13.bam |
Creates BAM index for random genomic access. |
sniffles ... --vcf 2025_chr13.vcf |
Detects structural variants and outputs VCF file. |
Now you got the variant file!
If it takes time… copy the vcf in your directory
cp /cluster/projects/nn9987k/BIO326-2025/data_euk_2026/2025_chr13.vcf 2025_chr13.vcf
Look inside the vcf
# INFO field
grep '^##' 2025_chr13.vcf | tail -n 20
| Field | Description |
|---|---|
MOSAIC |
Structural variation classified as putative mosaic |
SVLEN |
Length of structural variation |
SVTYPE |
Type of structural variation (INS, DEL, BND, etc.) |
CHR2 |
Mate chromosome for BND structural variants |
SUPPORT |
Number of reads supporting the structural variation |
SUPPORT_INLINE |
Reads supporting INS/DEL without split alignments |
SUPPORT_SA |
Reads supporting DEL via supplementary alignments |
SUPPORT_LONG |
Soft-clipped reads supporting long insertions |
END |
End position of the structural variation |
STDEV_POS |
Standard deviation of variant start position |
STDEV_LEN |
Standard deviation of variant length |
COVERAGE |
Coverage at upstream, start, center, end, downstream positions |
STRAND |
Strand orientation of supporting reads |
AC |
Allele count across all samples |
SUPP_VEC |
Per-sample support vector |
CONSENSUS_SUPPORT |
Reads supporting insertion consensus sequence |
RNAMES |
Names of supporting reads (if enabled) |
VAF |
Variant Allele Fraction |
NM |
Mean mismatch rate of supporting reads |
PHASE |
Phasing information derived from supporting reads |
# variants
grep -v '^##' 2025_chr13.vcf | more
| Field | Meaning |
|---|---|
SVTYPE |
Structural variant type (INS = insertion, DEL = deletion) |
SVLEN |
Variant length (negative values indicate deletions) |
SUPPORT |
Number of reads supporting the variant |
VAF |
Variant allele frequency |
GT |
Genotype (0/1 = heterozygous, 1/1 = homozygous alternate) |
DR |
Reads supporting the reference allele |
DV |
Reads supporting the variant allele |
Now you have variants! Lets see which genes are affected by the variants… in Part C!
In this practical, you will explore high-confidence structural variants (SVs) detected from long-read sequencing data in two samples (2024 and 2025).
The dataset has been reduced to chromosome 13 around selected SVs to ensure fast loading in Integrative Genome Viewer(IGV).
You will: - Visualize structural variants in IGV
- Compare genotypes between two samples
- Interpret read-level evidence
- Examine gene overlap
- Investigate biological function
| File Type | Role in IGV |
|---|---|
| FASTA | Defines coordinate system |
| BAM | Shows aligned reads (evidence) |
| VCF | Shows called variants (summary) |
| GTF | Shows gene structure (context) |
Bos_taurus_chr13.fajoint_chr13_teaching_20kb.vcf.gz2024_chr13_teaching_20kb.bam2025_chr13_teaching_20kb.bamBos_taurus_chr13.gtf.gzBelow are the structural variants detected on chromosome 13.
Choose a locus you like and investigate it in IGV.
| Chrom | Start | End | ID | Type | Length | Gene |
|---|---|---|---|---|---|---|
| 13 | 1720605 | 1721780 | Sniffles2.DEL.2FMC | DEL | -1174 | ENSBTAG00000008338 |
| 13 | 8238127 | 8238128 | Sniffles2.INS.86MC | INS | 3404 | ENSBTAG00000034441 |
| 13 | 40007605 | 40007755 | Sniffles2.DEL.1B4MC | DEL | -149 | ENSBTAG00000014178 |
| 13 | 46710350 | 46710351 | Sniffles2.INS.1F6MC | INS | 51 | ENSBTAG00000006531 |
| 13 | 47767130 | 47767178 | Sniffles2.DEL.1FEMC | DEL | -47 | ENSBTAG00000008293 |
| 13 | 74258437 | 74258438 | Sniffles2.INS.2ECMC | INS | 114 | ENSBTAG00000017328 |
| 13 | 75795925 | 75795926 | Sniffles2.BND.2F6MC | BND | NA | ENSBTAG00000013114 |
| 13 | 77829615 | 77829701 | Sniffles2.DEL.309MC | DEL | -85 | ENSBTAG00000012626 |
13:46710350) into the
IGV search bar.Select one structural variant locus from the table provided and investigate it in IGV. Report the genomic coordinates, variant type and size, genotype comparison between the 2024 and 2025 samples, whether the variant overlaps any genes, and briefly discuss the potential functional consequence for the affected gene.
Please also see the IGV help tab to understand each symbol: https://igv.org/doc/webapp/#UserGuide/
Example locus1
Sniffles2.INS.1F6MC — 13:46710350–46710351 — 51 bp insertion — ENSBTAG00000006531 (DIP2C)
This locus corresponds to a 51 bp insertion at position 46,710,350 on chromosome 13.
In IGV, the 2024 sample shows continuous read alignment across the site with no consistent breakpoint or inserted sequence, indicating a reference genotype at this position.
In contrast, the 2025 sample displays multiple independent reads containing the same inserted sequence at the identical coordinate, with split alignment patterns and clear insertion labeling in IGV, supporting the presence of the insertion allele.
According to the gene annotation track, the variant lies within the ENSBTAG00000006531 locus, corresponding to DIP2C. If located within an exon, a 51 bp insertion could alter the coding sequence while preserving the reading frame, potentially modifying protein structure.
Example locus2
Sniffles2.BND.2F6MC — 13:75795925–75795926 — BND — ENSBTAG00000013114
At this locus, compare the two samples directly.
In the 2024 sample, read coverage is uniform, mismatches are scattered, and there is no clustering of discordant alignments, indicating absence of a structural rearrangement.
In 2025, focus on the central region where the reads form a dense, colorful block. The vertical colored lines indicate many mismatches concentrated in the same short interval. Multiple reads show discordant alignment in the same position. When many reads change relative to the reference at the exact same coordinates, this is not random sequencing error.
This pattern suggests an inversion or a highly divergent sequence present in 2025 but absent in 2024.
The locus overlaps ENSBTAG00000013114. A breakend event within a gene may disrupt coding sequence continuity or regulatory structure, depending on whether the breakpoint falls within an exon or intron.
That’s it! You are now an expert in genome analysis!
If you are interested in a similar bioinformatics or genomics project for your master’s thesis, please contact our lab: Saitou Lab
(1) “Shouldn’t we map each dataset separately?”
In principle, we could. However, each dataset corresponds to roughly one twentieth of a PromethION flow cell. A typical PromethION flow cell produces on the order of -40 Gb of data. Dividing that by 20 gives approximately -2 Gb per subset.
For a mammalian genome of ~3 Gb, this corresponds to only ~0.8× coverage. In practical terms, this means that only small, scattered portions of the genome are sequenced, while most of the genome is not covered at all. With such sparse coverage, it is generally not possible to make reliable biological inferences. So mapping each subset separately is technically possible if you are curious to see, but the coverage is too low to draw meaningful conclusions.
(2) “Why are we using last year’s data?”
To perform proper variant calling, we would need to combine all ~20 subsets and map the merged dataset to the reference genome. However, this requires substantial computational time and additional processing steps. Within the limited time available in class, running a full variant-calling pipeline from raw data to final calls is not realistic. Therefore, we use a stable dataset to focus on understanding the workflow and its logic using a pre-prepared dataset (I will likely process your data using one year and show it to next year’s students).
(3) “Can we perform de novo assembly instead?”
For a mammalian genome of ~3 Gb, assembly requires much higher coverage. Long-read assembly typically benefits from tens of times coverage, for example, on the order of –50× depending on the goals (around 5 flowcells).
Using data from a single flow cell, some regions of the genome are not sequenced at all, or some regions are covered by only a single read. Assembly algorithms rely on overlapping reads to connect sequences into longer contiguous segments. When there are too few overlaps, reads cannot be reliably linked together, and large gaps remain. There is not enough information to reconstruct long stretches of the genome.
We will instead perform assembly using the bacterial dataset (PROK part), where the genome is much smaller and the same sequencing output provides sufficient coverage.