1 Overview of the tutorial

This R Markdown document provides the analytical workflow required to process and interpret the sequencing data generated during the wet lab.

The structure, format, and assessment criteria of the written report and oral presentation are defined in the official course guides. This document focuses on the analytical components that generate the data and figures used in those assessments.

1.1 Learning objectives and course alignment

This practical supports the course Learning Outcomes as follows:

Understanding nanopore sequencing principles and raw data quality assessment (Knowledge)
Understanding library preparation concepts (Knowledge)
Aligning reads to a reference genome and detecting genomic variants (Knowledge / Skills)
Critically assessing wet and dry lab results (Skills)
Applying methods to a defined scientific question (General competence)
Communicating scientific results (General competence)

1.1.1 Data usage and analytical structure

Part A is conducted using the sequencing data generated from your own wet lab experiment.
The objective is to directly compare how different physical DNA treatments (Control, Vortex, Needle, Temperature stress) influence sequencing performance metrics, including read length distribution, sequencing yield, and base quality before and after filtering.
This section establishes the causal link between laboratory manipulation and sequencing outcomes.

Parts B and C extend the analysis to mapping, structural variant detection, and biological interpretation.
Because full mapping and variant calling workflows on complete raw datasets are computationally intensive and require extended runtime on HPC infrastructure, a pre-processed and region-restricted dataset derived from previous sequencing runs is used for these sections.

This ensures that all students can complete the advanced analytical workflow within the allocated practical session while working with high-quality, reproducible example data suitable for interpretation in IGV and downstream biological analysis.

Together, Part A and Parts B–C form an analytical progression:
from experimental manipulation and quality assessment (wet-to-sequencing outcomes) to genomic localization and biological interpretation (sequencing-to-function).

1.2 For your report assignment

Part A Using the read quality values before and after filtering in the shared table (in Part A) Examine the effect of different DNA treatment methods on read quality and length.
Part C Select one structural variant locus from the table provided (in Part C) and investigate it in IGV. Report the genomic coordinates, variant type and size, genotype comparison between the 2024 and 2025 samples, whether the variant overlaps any genes, and briefly discuss the potential functional consequence for the affected gene.

1.2.1 Review basic command usages

WorkingWithSigma

File transfer between Sigma2 and your computer

2 Part A. Read quality check (Core Activity)

2.1 Overview

2.1.1 Objective

To quantitatively evaluate how different physical DNA treatments (Control, Vortex, Needle shearing, and Temperature extreme) influence long-read sequencing performance by comparing read length distribution, sequencing yield, and base quality metrics before and after quality filtering.

2.1.2 Experimental origin of the sequencing data

The FASTQ files analyzed in this section originate from four treatments performed during the wet laboratory session:

Control
Vortex
Needle shearing
Temperature extreme (freeze–thaw stress)

Following DNA extraction, concentrations were quantified using Qubit fluorometry.
Samples were normalized to equal DNA mass per barcode, barcoded individually, pooled, and sequenced using Oxford Nanopore technology.

2.2 Connect to Sigma2 and prepare the tools

In the Terminal/Command prompt, go to Sigma2 and your directory there.

Let’s review how to do so

ssh yourID@saga.sigma2.no

To not get naked in the lobby, create a vritual TMUX terminal

tmux new -s EUK

and run an interactive job for the class:

srun \
--account=nn9987k \
--partition=normal \
--gres=localscratch:50G \
--cpus-per-task 4 \
--nodes 1 \
--mem=10G \
--time=02:00:00 \
--pty bash \
-i

Let’s make a directory for analysis and enter it.

cd /cluster/projects/nn9987k/$USER
mkdir analysis # make directory "analysis"
cd analysis # set the current directory "analysis"

Now, you will inspect the fastq file from your experiment, which contains Nanopore read information.

2.2.1 DNA Quantification and Barcoding Overview

Before proceeding with the analysis, review the metadata corresponding to your assigned group.

Please identify:

Your Sample ID
The corresponding Treatment condition
The assigned Barcode ID

You can find the complete overview of samples and barcodes in the shared spreadsheet

Confirm that the FASTQ file you are working with matches your assigned barcode and treatment condition before running NanoPlot.

All the data is now here:

/cluster/projects/nn9987k/BIO326-2025/data_euk_2026/merged_fastq

Concatenated fastq files are named according to barcode NB:

NB01.fq.gz NB03.fq.gz NB05.fq.gz NB07.fq.gz NB09.fq.gz NB11.fq.gz NB13.fq.gz NB15.fq.gz NB17.fq.gz NB19.fq.gz NB02.fq.gz NB04.fq.gz NB06.fq.gz NB08.fq.gz NB10.fq.gz NB12.fq.gz NB14.fq.gz NB16.fq.gz NB18.fq.gz NB20.fq.gz

2.3 Explore the structure of a FASTQ file

2.3.1 Inspect the file structure

Open your FASTQ file and examine its structure.

zcat YourSample.fq.gz | head -20

FASTQ format uses 4 lines per read: 1. Header 2. Sequence 3. “+” 4. Quality string

FASTQ files contain both nucleotide sequences and per-base quality scores

by Illumina Knowledge

2.3.2 Count total lines and calculate number of reads

zcat YourSample.fq.gz | wc -l

FASTQ uses 4 lines per read.

You can calculate the number of reads with:

LINES=$(zcat YourSample.fq.gz | wc -l)
echo $((LINES / 4))

The total number of reads reflects sequencing yield. This will later be compared between treatments and after filtering.

2.3.3 Compare quality positions

zcat YourSample.fq.gz | sed -n 4p | cut -c 1-5
zcat YourSample.fq.gz | sed -n 4p | cut -c 21-25

Quality scores are encoded using Phred+33. Quality may vary along the read due to sequencing chemistry.

2.3.4 Extract one read longer than 10 kb

zcat YourSample.fq.gz | awk 'NR%4==2 && length($0) > 10000 {print; exit}'

Long reads are a key advantage of Nanopore sequencing. Read length distribution will later be compared across treatments.

2.4 Quality check by Nanoplot

The original fastq files may contain low quality reads. In this step, we will use “Nanoplot” to see the quality and length of each read.

Make a slurm script to conduct the quality check on the sample file like below and run it.

Review: make a slurm script and run it by sbatch


#!/bin/bash
#SBATCH --job-name=Nanoplot  # sensible name for the job
#SBATCH --mem=12G 
#SBATCH --ntasks=1   
#SBATCH --cpus-per-task=8
#SBATCH --output=nanoplot_before 
#SBATCH --account=nn9987k
#SBATCH --time=20:00  


##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate 


conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY

echo "Working with this $CONDA_PREFIX environment ..."


## run nanoplot

NanoPlot -t 8  --fastq /cluster/projects/nn9987k/BIO326-2025/day1/1-A.fq.gz --plots dot  --no_supplementary --no_static --N50 -p before_filter_A1

Run the script by:

sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/Nanoplot.before.euk.SLURM.sh <SAMPLE>

NB! Remember to use your sample ID for example: NB01 in

Nanoplot will generate the result files, named “before_filter.”xxx. Let’s look into them…


# Taking too long?

cp /cluster/projects/nn9987k/BIO326-2025/day1/before_filter_A1NanoPlot-report.html  before_filter_A1NanoPlot-report.html

Open “before_filter_A1NanoPlot-report.html” on your local computer. Review: File transfer between Sigma2 and your computer

Then you will see something like below: (click)

Compare mean vs median read length. A large difference suggests a skewed distribution with a long tail of very long reads.
N50 reflects the read length at which 50% of the total bases are contained in reads of that length or longer.
Compare mean read quality and median read quality. Differences may indicate heterogeneity in read quality distribution.

Together, these metrics capture the trade-off between yield and quality before and after filtering.

2.5 Filtering by Nanofilt

Now you will filter the obtained reads by length and read quality.


#!/bin/bash
#SBATCH --job-name=Nanofilt  # sensible name for the job
#SBATCH --mem=12G 
#SBATCH --ntasks=1   
#SBATCH --output=nanofilt 
#SBATCH --account=nn9987k
#SBATCH --time=20:00  


##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate 


conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY

echo "Working with this $CONDA_PREFIX environment ..."


## run nanoplot
gunzip -c /cluster/projects/nn9987k/BIO326-2025/day1/1-A.fq.gz | NanoFilt -q 12 -l 1000 | gzip > cleaned_A1.fq.gz

To run let’s do:

sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/NanoFilt.euk.SLURM.sh <SAMPLE>

NB! Remember to use your sample ID for example: NB01

Parameters and Conda explained(click)

This script activates a Conda environment (Your toolkit that Arturo pre-assembled) on a computing cluster:

This script activates a Conda environment on a computing cluster.

module load StdEnv
Loads the standard software environment.
module load Miniconda3/23.10.0-1
Loads Miniconda, a lightweight Conda distribution.
source ${EBROOTMINICONDA3}/bin/activate
Activates the base Miniconda installation.
conda activate EUK_DRY
Activates the project-specific Conda environment.
echo "Working with this $CONDA_PREFIX environment ..."
Prints the path of the currently active Conda environment.

Q Score < 12 | error rate cutoff |
Length < 1000 bp | read length cutoff |

-l, Filter on a minimum read length

-q, Filter on a minimum average read quality score

In this case, we are removing reads lower than quality score 12 and shorter than 1000 bases.

If you are ambitious, please adjust the filtering parameters and see how they change the result.

(In that case, do not forget to name the result files differently.)

2.6 Compare the sequences before and after cleaning

Run Nanoplot again on the cleaned sequences, and compare the quality metrics before and after the filtering.

Need help? (click)


#!/bin/bash
#SBATCH --job-name=Nanoplot  # sensible name for the job
#SBATCH --mem=12G 
#SBATCH --ntasks=1   
#SBATCH --cpus-per-task=8
#SBATCH --output=nanoplot_after 
#SBATCH --account=nn9987k
#SBATCH --time=20:00  


##Activate conda environment
module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate 


conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY

echo "Working with this $CONDA_PREFIX environment ..."


## run nanoplot

NanoPlot -t 8  --fastq cleaned_A1.fq.gz --plots dot  --no_supplementary --no_static --N50 -p after_filter_A1

To run let’s do:

sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/Nanoplot.cleaned.euk.SLURM.sh <SAMPLE>

NB! Remember to use your sample ID for example: NB01

Open “after_filter.$SAMPLENanoPlot-report.html” on your local computer.

2.6.1 Discussion Point

Did you see the difference of read and quality distribution between before and after the filtering? Discuss within/between groups.

2.7 Core Activity: Quality Metrics Reporting

Your mission

After running NanoPlot on your dataset (before and after filtering), extract the following summary statistics and enter them into the shared spreadsheet:

Total reads
Total bases
Read N50
Mean read quality

Record these values for both:

Before filtering
After filtering

Use the NanoPlot summary report (.html or .txt) to retrieve the metrics.
Ensure that you enter values corresponding to your assigned sample and treatment.

The purpose of this activity is to compare sequencing yield and quality before and after filtering across treatments.

3 Part B. Mapping to the reference genome

3.1 Objective

To understand the core logic of read mapping and structural variant calling by running (or inspecting) a minimal long-read pipeline: 1) map reads to a reference genome,
2) generate a sorted/indexed BAM file,
3) call structural variants (SVs) into a VCF file,
4) inspect variant records and interpret key fields (SVTYPE, SVLEN, GT).

3.1.1 What is Read Mapping?

Aligning sequencing reads to a reference genome.
Essential for variant detection to find out which region is different from your sample and standard sample.

3.2 (Optional) Run Minimap and map the reads to the reference genome

#!/bin/bash
#SBATCH --job-name=Sniffles 🐽
#SBATCH --mem=12G 
#SBATCH --ntasks=1   
#SBATCH --cpus-per-task=8
#SBATCH --output=sniffles_%j.out
#SBATCH --account=nn9987k
#SBATCH --time=20:00  

########################################
# 🧬 Usage check
########################################

if [ -z "$1" ]; then
    echo "❌ ERROR: You must provide an OUTPUT DIRECTORY!"
    echo "👉 Usage: sbatch sniffles.sh /path/to/output_directory"
    exit 1
fi

OUTDIR=$1

echo "📂 Output directory: $OUTDIR"

# Create directory if it doesn't exist
mkdir -p "$OUTDIR"

########################################
# 🐍 Activate conda environment
########################################

module load StdEnv
module load Miniconda3/23.10.0-1
source ${EBROOTMINICONDA3}/bin/activate 
conda activate /cluster/projects/nn9987k/.share/conda_environments/EUK_DRY

echo "🐍 Using conda environment: $CONDA_PREFIX"
echo "🖥️  CPUs: $SLURM_CPUS_PER_TASK"
echo "--------------------------------------"

########################################
# 🧭 Define input files (you can later make these arguments too)
########################################

REF=/cluster/projects/nn9987k/BIO326-2025/day2/Bos_taurus_chr13.fa
READS=/cluster/projects/nn9987k/BIO326-2025/day2/2025_chr13_teaching_20kb.fastq.gz

PREFIX=2025_chr13

########################################
# 🧬 Mapping
########################################

echo "🧬 Mapping with minimap2..."

minimap2 -t $SLURM_CPUS_PER_TASK -ax map-ont \
"$REF" "$READS" \
> ${OUTDIR}/${PREFIX}.sam

echo "✅ Mapping done"

########################################
# 🔄 SAM ➜ BAM
########################################

echo "🔄 Converting to BAM..."
samtools view -@ $SLURM_CPUS_PER_TASK -S -b \
${OUTDIR}/${PREFIX}.sam > ${OUTDIR}/${PREFIX}_temp.bam

echo "📊 Sorting BAM..."
samtools sort -@ $SLURM_CPUS_PER_TASK \
${OUTDIR}/${PREFIX}_temp.bam \
-o ${OUTDIR}/${PREFIX}.bam

echo "📌 Indexing BAM..."
samtools index ${OUTDIR}/${PREFIX}.bam

########################################
# 🧪 Variant Calling
########################################

echo "🧬 Calling structural variants with Sniffles..."

sniffles \
  --input ${OUTDIR}/${PREFIX}.bam \
  --vcf ${OUTDIR}/${PREFIX}.vcf \
  --threads $SLURM_CPUS_PER_TASK

echo "🎉 Variant calling finished!"
echo "📄 VCF file: ${OUTDIR}/${PREFIX}.vcf"

########################################
# 🧹 Cleanup
########################################

echo "🧹 Cleaning temporary files..."
rm -f ${OUTDIR}/${PREFIX}.sam
rm -f ${OUTDIR}/${PREFIX}_temp.bam

echo "✨ Pipeline completed successfully!"

Run:

cd /cluster/projects/nn9987k/$USER && mkdir sniffles && cd sniffles && sbatch /cluster/projects/nn9987k/BIO326-2025/scripts_euk_2026/sniffles.SLURM.sh /cluster/projects/nn9987k/$USER/sniffles

What each step is doing (click)

Workflow Overview

Command	Purpose
`minimap2 ... > 2025_chr13.sam`	Maps long reads to the reference genome and outputs SAM format.
`samtools view ... > 2025_chr13_temp.bam`	Converts SAM to compressed BAM format.
`samtools sort ... -o 2025_chr13.bam`	Sorts alignments by coordinate for downstream compatibility.
`samtools index ... 2025_chr13.bam`	Creates BAM index for random genomic access.
`sniffles ... --vcf 2025_chr13.vcf`	Detects structural variants and outputs VCF file.

3.2.1 What is Variant Calling?

Identifying differences between sequenced DNA and a reference genome.
Important for studying mutations and genetic variation.
bam and sam format

Now you got the variant file!

3.3 Investigate the variants

If it takes time… copy the vcf in your directory


cp /cluster/projects/nn9987k/BIO326-2025/data_euk_2026/2025_chr13.vcf  2025_chr13.vcf

Look inside the vcf

# INFO field

grep '^##' 2025_chr13.vcf | tail -n 20

INFO Field Descriptions

Field	Description
`MOSAIC`	Structural variation classified as putative mosaic
`SVLEN`	Length of structural variation
`SVTYPE`	Type of structural variation (INS, DEL, BND, etc.)
`CHR2`	Mate chromosome for BND structural variants
`SUPPORT`	Number of reads supporting the structural variation
`SUPPORT_INLINE`	Reads supporting INS/DEL without split alignments
`SUPPORT_SA`	Reads supporting DEL via supplementary alignments
`SUPPORT_LONG`	Soft-clipped reads supporting long insertions
`END`	End position of the structural variation
`STDEV_POS`	Standard deviation of variant start position
`STDEV_LEN`	Standard deviation of variant length
`COVERAGE`	Coverage at upstream, start, center, end, downstream positions
`STRAND`	Strand orientation of supporting reads
`AC`	Allele count across all samples
`SUPP_VEC`	Per-sample support vector
`CONSENSUS_SUPPORT`	Reads supporting insertion consensus sequence
`RNAMES`	Names of supporting reads (if enabled)
`VAF`	Variant Allele Fraction
`NM`	Mean mismatch rate of supporting reads
`PHASE`	Phasing information derived from supporting reads

# variants
grep -v '^##' 2025_chr13.vcf | more

Field Interpretation

Field	Meaning
`SVTYPE`	Structural variant type (INS = insertion, DEL = deletion)
`SVLEN`	Variant length (negative values indicate deletions)
`SUPPORT`	Number of reads supporting the variant
`VAF`	Variant allele frequency
`GT`	Genotype (0/1 = heterozygous, 1/1 = homozygous alternate)
`DR`	Reads supporting the reference allele
`DV`	Reads supporting the variant allele

Now you have variants! Lets see which genes are affected by the variants… in Part C!

4 Part C. Estimate the effect of variants

4.1 Overview

In this practical, you will explore high-confidence structural variants (SVs) detected from long-read sequencing data in two samples (2024 and 2025).

The dataset has been reduced to chromosome 13 around selected SVs to ensure fast loading in Integrative Genome Viewer(IGV).

You will: - Visualize structural variants in IGV
- Compare genotypes between two samples
- Interpret read-level evidence
- Examine gene overlap
- Investigate biological function

4.2 Download the dataset from Figshare:

https://figshare.com/articles/dataset/BIO326_2026/31323997

After downloading, unzip the folder and keep all files together in the same directory.

Reference

Bos_taurus_chr13.fa
Bos_taurus_chr13.fa.fai

Alignment files

2024_chr13_teaching_20kb.bam
2024_chr13_teaching_20kb.bam.bai
2025_chr13_teaching_20kb.bam
2025_chr13_teaching_20kb.bam.bai

Variant file - joint_chr13_teaching_20kb.vcf.gz
- joint_chr13_teaching_20kb.vcf.gz.tbi

Gene annotation - Bos_taurus_chr13.gtf.gz
- Bos_taurus_chr13.gtf.gz.tbi

File Formats Explained (click)

File Formats Explained

FASTA format (.fa)

The FASTA file contains the reference DNA sequence.

Example:

>13
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA...

Structure:

Component	Meaning
> line	Sequence header (chromosome name)
DNA letters	Reference nucleotide sequence (A, T, C, G)

IGV uses this file as the coordinate system.

FASTA index (.fai)

Generated by samtools faidx.

Example:

13 84296534 4 60 61

Column	Meaning
1	Sequence name
2	Length
3	Offset in file
4	Bases per line
5	Bytes per line

Purpose: - Enables rapid random access - Required by IGV

BAM format (.bam)

BAM is a compressed binary alignment file.
It stores how sequencing reads align to the reference.

Example (SAM view):

read001  0  13  1720605  60  150M  *  0  0  ACTG...  FFFFF...

Core columns:

Column	Meaning
QNAME	Read name
FLAG	Alignment flags
RNAME	Chromosome
POS	Start position
MAPQ	Mapping quality
CIGAR	Alignment pattern
SEQ	Read sequence
QUAL	Base quality

VCF format (.vcf)

Variant Call Format stores detected variants.

Example (simplified):

13 1720605 Sniffles2.DEL.2FMC N <DEL> 52 PASS PRECISE;SVTYPE=DEL;SVLEN=-1174;END=1721780;SUPPORT=8 GT:GQ:DR:DV 1/1:40:0:8 0/0:35:7:0

Core columns:

Column	Meaning
CHROM	Chromosome
POS	Start position
ID	Variant ID
REF	Reference allele
ALT	Alternate allele
QUAL	Quality score
FILTER	PASS or filtered
INFO	Variant metadata
FORMAT	Genotype structure
Sample columns	Per-sample genotype values

Important INFO fields:

Field	Meaning
SVTYPE	DEL or INS
SVLEN	Variant length
END	End coordinate
SUPPORT	Read support
PRECISE	Breakpoint confidence

Important FORMAT fields:

Field	Meaning
GT	Genotype (0/0, 1/1)
DR	Reference read count
DV	Variant read count
GQ	Genotype quality

GTF format (.gtf)

Gene annotation format.

Example:

13 ensembl gene 1720000 1735000 . + . gene_id "ENSBTAG00000008338"; gene_name "PLCB1";

Columns:

Column	Meaning
1	Chromosome
2	Source
3	Feature type
4	Start
5	End
7	Strand
9	Attributes

Used in IGV to display gene models.

4.3 Load the data in IGV

Open IGV. https://igv.org/app/

File Type	Role in IGV
FASTA	Defines coordinate system
BAM	Shows aligned reads (evidence)
VCF	Shows called variants (summary)
GTF	Shows gene structure (context)

Load the reference: Genomes → Load Genome from File… → Bos_taurus_chr13.fa
Load tracks: File → Load from File… and add:
- joint_chr13_teaching_20kb.vcf.gz
- 2024_chr13_teaching_20kb.bam
- 2025_chr13_teaching_20kb.bam
- Bos_taurus_chr13.gtf.gz

4.4 Structural Variant Loci to Explore

Below are the structural variants detected on chromosome 13.
Choose a locus you like and investigate it in IGV.

Chrom	Start	End	ID	Type	Length	Gene
13	1720605	1721780	Sniffles2.DEL.2FMC	DEL	-1174	ENSBTAG00000008338
13	8238127	8238128	Sniffles2.INS.86MC	INS	3404	ENSBTAG00000034441
13	40007605	40007755	Sniffles2.DEL.1B4MC	DEL	-149	ENSBTAG00000014178
13	46710350	46710351	Sniffles2.INS.1F6MC	INS	51	ENSBTAG00000006531
13	47767130	47767178	Sniffles2.DEL.1FEMC	DEL	-47	ENSBTAG00000008293
13	74258437	74258438	Sniffles2.INS.2ECMC	INS	114	ENSBTAG00000017328
13	75795925	75795926	Sniffles2.BND.2F6MC	BND	NA	ENSBTAG00000013114
13	77829615	77829701	Sniffles2.DEL.309MC	DEL	-85	ENSBTAG00000012626

Copy a coordinate (for example: 13:46710350) into the IGV search bar.
Compare the 2024 and 2025 BAM tracks.
Determine:
- Do the genotypes differ between samples?
- Does the variant overlap an exon, intron, or intergenic region?

4.5 For your report assignment

Select one structural variant locus from the table provided and investigate it in IGV. Report the genomic coordinates, variant type and size, genotype comparison between the 2024 and 2025 samples, whether the variant overlaps any genes, and briefly discuss the potential functional consequence for the affected gene.

4.6 Example views

Please also see the IGV help tab to understand each symbol: https://igv.org/doc/webapp/#UserGuide/

Example locus1

Sniffles2.INS.1F6MC — 13:46710350–46710351 — 51 bp insertion — ENSBTAG00000006531 (DIP2C)

This locus corresponds to a 51 bp insertion at position 46,710,350 on chromosome 13.

In IGV, the 2024 sample shows continuous read alignment across the site with no consistent breakpoint or inserted sequence, indicating a reference genotype at this position.

In contrast, the 2025 sample displays multiple independent reads containing the same inserted sequence at the identical coordinate, with split alignment patterns and clear insertion labeling in IGV, supporting the presence of the insertion allele.

According to the gene annotation track, the variant lies within the ENSBTAG00000006531 locus, corresponding to DIP2C. If located within an exon, a 51 bp insertion could alter the coding sequence while preserving the reading frame, potentially modifying protein structure.

Example locus2

Sniffles2.BND.2F6MC — 13:75795925–75795926 — BND — ENSBTAG00000013114

At this locus, compare the two samples directly.

In the 2024 sample, read coverage is uniform, mismatches are scattered, and there is no clustering of discordant alignments, indicating absence of a structural rearrangement.

In 2025, focus on the central region where the reads form a dense, colorful block. The vertical colored lines indicate many mismatches concentrated in the same short interval. Multiple reads show discordant alignment in the same position. When many reads change relative to the reference at the exact same coordinates, this is not random sequencing error.

This pattern suggests an inversion or a highly divergent sequence present in 2025 but absent in 2024.

The locus overlaps ENSBTAG00000013114. A breakend event within a gene may disrupt coding sequence continuity or regulatory structure, depending on whether the breakpoint falls within an exon or intron.

4.7 Well Done!

That’s it! You are now an expert in genome analysis!

If you are interested in a similar bioinformatics or genomics project for your master’s thesis, please contact our lab: Saitou Lab

5 Frequently Asked Questions

(1) “Shouldn’t we map each dataset separately?”

In principle, we could. However, each dataset corresponds to roughly one twentieth of a PromethION flow cell. A typical PromethION flow cell produces on the order of -40 Gb of data. Dividing that by 20 gives approximately -2 Gb per subset.

For a mammalian genome of ~3 Gb, this corresponds to only ~0.8× coverage. In practical terms, this means that only small, scattered portions of the genome are sequenced, while most of the genome is not covered at all. With such sparse coverage, it is generally not possible to make reliable biological inferences. So mapping each subset separately is technically possible if you are curious to see, but the coverage is too low to draw meaningful conclusions.

(2) “Why are we using last year’s data?”

To perform proper variant calling, we would need to combine all ~20 subsets and map the merged dataset to the reference genome. However, this requires substantial computational time and additional processing steps. Within the limited time available in class, running a full variant-calling pipeline from raw data to final calls is not realistic. Therefore, we use a stable dataset to focus on understanding the workflow and its logic using a pre-prepared dataset (I will likely process your data using one year and show it to next year’s students).

(3) “Can we perform de novo assembly instead?”

For a mammalian genome of ~3 Gb, assembly requires much higher coverage. Long-read assembly typically benefits from tens of times coverage, for example, on the order of –50× depending on the goals (around 5 flowcells).

Using data from a single flow cell, some regions of the genome are not sequenced at all, or some regions are covered by only a single read. Assembly algorithms rely on overlapping reads to connect sequences into longer contiguous segments. When there are too few overlaps, reads cannot be reliably linked together, and large gaps remain. There is not enough information to reconstruct long stretches of the genome.

We will instead perform assembly using the bacterial dataset (PROK part), where the genome is much smaller and the same sequencing output provides sufficient coverage.

Bio326_2026

MS and colleagues

2026-02

1 Overview of the tutorial

1.1 Learning objectives and course alignment

1.1.1 Data usage and analytical structure

1.2 For your report assignment

1.2.1 Review basic command usages

2 Part A. Read quality check (Core Activity)

2.1 Overview

2.1.1 Objective

2.1.2 Experimental origin of the sequencing data

2.2 Connect to Sigma2 and prepare the tools

2.2.1 DNA Quantification and Barcoding Overview

2.3 Explore the structure of a FASTQ file

2.3.1 Inspect the file structure

2.3.2 Count total lines and calculate number of reads

2.3.3 Compare quality positions

2.3.4 Extract one read longer than 10 kb

2.4 Quality check by Nanoplot

2.5 Filtering by Nanofilt

2.6 Compare the sequences before and after cleaning

2.6.1 Discussion Point

2.7 Core Activity: Quality Metrics Reporting

3 Part B. Mapping to the reference genome

3.1 Objective

3.1.1 What is Read Mapping?

3.2 (Optional) Run Minimap and map the reads to the reference genome

3.2.1 What is Variant Calling?

3.3 Investigate the variants

4 Part C. Estimate the effect of variants

4.1 Overview

4.2 Download the dataset from Figshare:

4.3 Load the data in IGV

4.4 Structural Variant Loci to Explore

4.5 For your report assignment

4.6 Example views

4.7 Well Done!

5 Frequently Asked Questions