1 Introduction

Conflicting phylogenetic signals are common in plant phylogenomics and often reflect complex evolutionary histories shaped by processes like hybridization, incomplete lineage sorting, and whole genome duplication. Understanding these patterns is especially challenging in large, diverse families like Asteraceae. This study, along with a growing body of phylogenomic research, emphasizes the role of reticulation and whole-genome duplication in the evolution of large, diverse clades, while also underscoring the challenges. Given the complexity of many of these bioinformatic approaches, a tutorial was developed to guide researchers through the methodological processes, making it applicable to other study systems. Further advancements in bioinformatic approaches will continue to enhance theoretical studies in the field of plant evolution.

This tutorial may serve as a template for researchers aiming to investigate the complex evolutionary processes that influence the diversity of life. Researchers may contact moore.erika.r@gmail.com or pllestad@memphis.edu regarding questions in methodology. This pipeline implements code in multiple languages, mainly bash, R, python, and julia, and often uses conda as a way to quickly and cleanly install packages on a HPC.

To use conda environments on a Linux system, install miniconda3 following the steps below.

#bash
# copy the bash script needed to install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
 # run the bash script to actually install
bash ~/Miniconda3-latest-Linux-x86_64.sh
# activate your installer
source ~/miniconda3/bin/activate 
# make miniconda3 automatically load every time you join your HPC
conda init --all

2 Preparing genome sequence data

2.1 Assessing raw sequence quality

FastqQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a tool that can spot potential problems in high throughput sequencing datasets. Input can be raw or trimmed sequence files in fastq or bam format. The result of FastQC is an html report which summarises the sequence quality of your samples.

2.1.1 Installing FastQC

Install FastQC using bioconda with the following code:

#bash
# Install program with Bioconda
conda create --name fastqc
conda activate fastqc
conda install -c bioconda fastqc

2.1.2 Running FastQC

INPUT

Raw sequence files in fastq of bam format, zipped or unzipped.

If you have many files, you can run it as a loop. For example, with fastq files:

#bash
for filename in *.fastq
do
fastqc $filename
done

OUTPUT

html file

This html file can be opened through any web browser. A great resource to better understand the report can be found at https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html.

2.2 Trimming sequence data

Trimmomatic (Bolger et al., 2014) is a read trimming tool that removes Illumina adapters from low-quality bases and adapter contamination. It’s important to do this step first to make sure you have cleaned and trimmed data going forward.

2.2.1 Installing Trimmomatic

Installation instructions can be found at http://www.usadellab.org/cms/?page=trimmomatic.

2.2.2 Running Trimmomatic

INPUT

Paired-end raw sequence files in fastq format.

After installation, the java script needed to run Trimmomatic should be in a folder such as ‘Trimmomatic-0.36’. Go into your folder containing the sequence files and run Trimmomatic in a loop on all paired-end fastq files (here ending in R1.fastq and R2.fastq) using the following code:

#bash
for fileR1 in *R1.fastq
do
fileR2=`echo ${fileR1} | sed 's/R1/R2/'`
java -jar /PATH/TO/Trimmomatic-0.36/trimmomatic-0.36.jar PE $fileR1 $fileR2 $fileR1.tp.fastq $fileR1.tunp.fastq $fileR2.tp.fastq $fileR2.tunp.fastq ILLUMINACLIP:/PATH/TO/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36
done

One common change made to the script is changing the values for ILLUMINACLIP, LEADING, TRAILING, SLIDINGWINDOW, and MINLEN. Please refer to the GitHub or primary literature to determine which values best fit your data.

OUTPUT

*.tp.fastq: the trimmed sequence data
*.tunp.fastq: the sequence data removed from trimming

[Optional] Re-run FastQC on the trimmed sequence data to see if the quality improved from trimming!

3 Assembling Sequence Data

This tutorial covers different sequence assembly methods including SPAdes (Prjibelski et al., 2020) for genome assembly using target-enriched sequence data, GetOrganelle (Jin et al., 2020) for chloroplast assembly using off-target reads from target-enriched sequence data, and Trinity (Grabherr et al., 2011) for transcriptome assembly of RNA sequence data.

3.1 Assembling genomes from target-enriched sequencing data

SPAdes (Prjibelski et al., 2020) is a versatile toolkit designed for assembly and analysis of sequencing data, and was primarily developed for Illumina sequencing data.

3.1.1 Installing SPAdes

Installation instructions can be found at https://github.com/ablab/spades.

3.1.2 Running SPAdes

INPUT

*.tp.fastq: the trimmed sequence data

#bash
for trimR1 in `find ./ -name "*R1.fastq.tp.fastq"`    
do
trimR2=`echo ${trimR1} | sed 's/R1/R2/'`
/PATH/TO/bin/spades.py -k 21,33,55,77,99 --only-assembler --pe1-1 $trimR1 --pe1-2 $trimR2 -o ./$trimR1.spades_output 
done

# Rename spade output folder making it short having only sample name
for i in *;
do
mv "./$i" "./$(echo $i |grep ".fastq.spades_output"| awk '{split($0,a,"_");print a[1]}')"
done

# copy all the contig files with the sample name in a folder named ‘contigs’
mkdir ../contigs
for f in *;
do
cp "$f/contigs.fasta" "./contigs/$f.fasta";
done

OUTPUT

assembly_graph_after_simplification.gfa
assembly_graph.fastg
assembly_graph_with_scaffolds.gfa
before_rr.fasta
contigs.fasta
contigs.paths
dataset.info

More information on output from SPAdes can be found at https://ablab.github.io/spades/output.html.

3.2 Assembling transcriptomes from RNA-seq data

Trinity (Grabherr et al., 2011) represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.

3.2.1 Installing Trinity

Installation instructions can be found at https://github.com/trinityrnaseq/trinityrnaseq/wiki.

3.2.2 Running Trinity

INPUT

*.tp.fastq: trimmed RNA-sequence data

#bash
#run trinity on all trimmed paired end files in a loop
for fileR1 in *R1.tp.fastq
do
fileR2=`echo ${fileR1} | sed 's/R1/R2/'`
Trinity --seqType fq --left $fileR1 --right $fileR2 --CPU 6 --max_memory 20G
done

OUTPUT

*.fasta: assembled transcripts in fasta format

3.3 Assembling chloroplast genomes from target-enriched sequencing data

Organellar sequences may be extracted from off-target sequence reads and assembled using GetOrganelle (https://github.com/Kinggerm/GetOrganelle) (Jin et al., 2020). Often, there will be insufficient off-target sequence data for complete organellar assembly. The integrity of complete chloroplast assemblies may be assessed using Bandage (Wick et al., 2015), and then annotated using GeSeq (Tillich et al., 2017).

3.3.1 Installing GetOrganelle

#bash
conda create --name getorganelle
conda activate getorganelle
conda install -c bioconda getorganelle

3.3.2 Running GetOrganelle

INPUT

*.tp.fastq, the trimmed sequence data

#bash
conda activate getorganelle
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o plastome_output -R 15 -k 21,45,65,85,105 -F embplant_pt

Other parameters may be adjusted to support the success of a complete plastome assembly. We prefer to increase the number of rounds (-R), double the disentangle time limit (–disentangle-time-limit 7200), then possibly decrease the word size (-w) based on the default word size found in the log file. Refer to https://github.com/Kinggerm/GetOrganelle for more parameter adjustments.

OUTPUT

*.path_sequence.fasta, the plastome assembly
*.selected_graph.gfa, the organelle-only assembly graph
Additional output is described at https://github.com/Kinggerm/GetOrganelle

3.3.3 Installing Bandage

Check integrity of plastome assembly with Bandage (Wick et al., 2015) using the file “*.selected_graph.gfa” as input. Bandage can be downloaded as a GUI executable for Mac or Windows but can also be run in command-line on Linux systems using the following code from https://github.com/rrwick/Bandage/wiki/Command-line.

#bash
#Upload to your HPC
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip

unzip Bandage_Ubuntu_dynamic_v0_8_1.zip

3.3.4 Running Bandage

INPUT

*.selected_graph.gfa, output from GetOrganelle

#bash
for file in *.p_ctg.gfa;
do
~/Bandage image $file $file.jpg;
done

OUTPUT

visualizations may be exported in multiple formats

Bandage visualization of contig structure of plastome assembly for Amphoricarpos autariatus. A complete plastome assembly should show this structure representing the 4 quadripartite regions of the plastome: a large circle representing the large single copy region, a small circle representing the small single copy region, and a connecting line representing the two inverted repeat regions.

Figure 3.1: Bandage visualization of contig structure of plastome assembly for Amphoricarpos autariatus. A complete plastome assembly should show this structure representing the 4 quadripartite regions of the plastome: a large circle representing the large single copy region, a small circle representing the small single copy region, and a connecting line representing the two inverted repeat regions.

3.3.5 Running GeSeq

GeSeq (Tillich et al., 2017) is a web-based application that was developed for the rapid and accurate annotation of organelle genomes, in particular chloroplast genomes. Complete circular plastome assemblies may be annotated using GeSeq (Tillich et al., 2017) and visualized using OGDraw at https://chlorobox.mpimp-golm.mpg.de/geseq.html.

Figure 3.2: Annotated chloroplast of Amphoricarpos autariatus produced by OGDraw.

4 Generating Species Phylogenies

4.1 Identifying ultra-conserved elements from target-enriched data

PHYLUCE (Faircloth, 2016) is a software package that determines orthology from targeted loci (e.g., from target-enrichment data). It is considered ‘ultra-conserved’ compared to other orthology determination software programs (e.g., HybPiper (Johnson et al., 2016)) since it removes any loci that are considered paralogous. If you have used HybPiper, you can skip down to the Model Selection section using the concatenated gene file output.

4.1.1 Installing Phyluce

#bash
conda create --name phyluce
conda activate phyluce
conda install bioconda::phyluce

4.1.2 Running PHYLUCE

INPUT

contigs/*.fasta: folder containing all contig files (output from SPAdes)
datasets.conf: a file containing a list of species used in the analysis. Species names should be separated by “_“. When using this, the user needs to reference which species to refer to by indicating it with the ‘taxon-group’ flag.
phyluce_cos1061_probes.fasta: the probe file with gene names and sequences from targeted genes.

NOTE: Make sure to change the file paths throughout and change the number of taxa to the appropriate number at each ‘–taxa’ flag when needed.

#bash
#activate the conda environment
conda activate phyluce

# Create empty log folder
mkdir ./log

# Generate *.lastz files for each contig from SPAdes.
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_match_contigs_to_probes     --contigs ./contigs     --probes cos_probes.fasta     --output ./output     --log-path ./log

#Resulting files are *.lastz files in the ‘output’ folder. These files only need to be made once and are only probe-specific and not species–specific.

# Create empty output directory, in our case it is named ‘taxon-set-all’
mkdir ./taxon-set-all/

#create the initial list of loci for each taxon
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_get_match_counts     --locus-db ./output/probe.matches.sqlite     --taxon-list-config datasets.conf     --taxon-group 'subset4'     --output ./taxon-set-all/all.conf     --incomplete-matrix     --log-path ./log

# extract FASTA data that correspond to the loci in all-taxa-incomplete.conf
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_get_fastas_from_match_counts  --contigs ./contigs   --locus-db ./output/probe.matches.sqlite --match-count-output ./taxon-set-all/all.conf --incomplete-matrix ./taxon-set-all/all.incomplete --output ./taxon-set-all/all.fasta --log-path ./log

# Explode the monolithic FASTA
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_assembly_explode_get_fastas_file --input ./taxon-set-all/all.fasta --output exploded-fastas

# Then run the below code to get stats
for i in exploded-fastas/*.fasta;
do
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_assembly_get_fasta_lengths --input $i --csv;
done > fasta_lengths.csv
#the resulting csv file has summary stats on the FASTAS
#the column headers are: samples,contigs,total bp,mean length,95 CI length,min length,max length,median length,contigs >1kb

# Alignment without internal trimming BUT does edge trim 
#CHANGE TAXON NUMBER
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_seqcap_align --input ./taxon-set-all/all.fasta  --output ./taxon-set-all/mafft-nexus-trimmed/ --taxa 11 --aligner mafft --cores 4 --incomplete-matrix --log-path ./log

# Get basic stats on the alignments  
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_get_align_summary_data --alignments ./taxon-set-all/mafft-nexus-trimmed/ --cores 4 --log-path ./log

# Remove locus name from alignments   
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_remove_locus_name_from_files --alignments ./taxon-set-all/mafft-nexus-trimmed/ --output ./taxon-set-all/mafft-nexus-trimmed-clean/ --log-path ./log

# Generates individual gene matrix files that will be used for the pseudo-coalescent analysis
#CHANGE TAXON NUMBER
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_get_only_loci_with_min_taxa --alignments ./taxon-set-all/mafft-nexus-trimmed-clean --taxa 11 --percent 0 --output ./taxon-set-all/mafft-nexus-trimmed-clean-genes --cores 4 --log-path ./log

# Build the total concatenated data matrix from the gene matrices
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_concatenate_alignments --alignments ./taxon-set-all/mafft-nexus-trimmed-clean-genes --output ./taxon-set-all/mafft-nexus-trimmed-raxml --phylip --log-path ./log

# Converts nexus to phylip-relaxed file format
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_convert_one_align_to_another --alignments ./taxon-set-all/mafft-nexus-trimmed-clean-genes --output ./taxon-set-all/mafft-fasta --input-format nexus --output-format phylip-relaxed --cores 1 --log-path ./log

OUTPUT

fasta_lengths.csv, summary stats on all sequences
./Exploded_fastas/, folder containing unaligned sequence data used to generate the summary statistics (fasta_lengths.csv)
./Taxon_set_all/
all.conf, logs which taxa and genes were recovered with PHYLUCE
all.fasta, all gene sequences that matched the probe set for each taxon as a multifasta file
all.incomplete, lists all loci that did not match the probe set for each taxon
mafft-nexus-trimmed/, folder containing all aligned sequences without internal trimming BUT does edge trim
mafft-nexus-trimmed-clean/, folder similar to ‘mafft-nexus-trimmed/’ but removed the locus name from the alignment
mafft-nexus-trimmed-clean-genes/, folder containing gene files that will be used for pseudo-coalescent analysis
mafft-nexus-trimmed-raxml/, folder containing the concatenated gene matrix of all genes and taxa along with a ‘charsets’ file that will be used for model selection
mafft-fasta/, folder similar to ‘mafft-nexus-trimmed-raxml/’ but has the files as phylip-relaxed instead of nexus

(Optional) Extracting data from the sqlite database produced by PHYLUCE PHYLUCE produces a sqlite database that contains two tables: “matches” and “match_map”. “matches” contains which loci were recovered for each taxon, while “match_map” has the name of the contig that matches each locus in each taxon. We can use the following commands to save these tables as csv files.

Navigate to the PHYLUCE output/ folder. The sqlite file ( probe.matches.sqlite) should be there.
Open the file: sqlite3 probe.matches.sqlite
“sqlite>” should be in the working line. NOTE: With the code below, the “sqlite>” bit is the prompt, you don’t need to type it out. Type one command at a time and hit enter.

sqlite> .headers on
sqlite> .mode csv
sqlite> .output matches.csv #creates the first file
sqlite> SELECT * FROM matches; #populate the file with the contents of the matches table
sqlite> .output match_map.csv #creates the second file
sqlite> SELECT * FROM match_map; #populate the file with the contents of the match_map
sqlite> .quit #exits sqlite

Output: matches.csv and match_map.csv. These files can be downloaded to get more stats on your PHYLUCE run.

4.2 Model Selection

These next few steps will detail how to run PartitionFinder (Lanfear et al., 2016) to determine which evolutionary model best fit our data used to then generate gene trees.

4.2.1 Installing PartitionFinder

Installation instructions can be found at www.robertlanfear.com/partitionfinder/tutorial.

4.2.2 Running PartitionFinder

INPUT

*.phylip, tree file (output from Phyluce)
*.charsets, character set file (output from Phyluce)

Using the charsets file, create a configuration (.cfg) file with information about the matrix, using the following command to make this file.

#bash
#designate the input file. Change if needed

input_file="mafft-nexus-trimmed-raxml.charsets" 
#designate the output file. Change if needed

output_file="partition_finder.cfg" 

# Start by writing the static part of the output
echo "alignment = mafft-nexus-trimmed-raxml.phylip;" > "$output_file" #Change the alignment name if needed
echo "branchlengths = linked;" >> "$output_file"
echo "models = all;" >> "$output_file"
echo "model_selection = AICc;" >> "$output_file"
echo "[data_blocks]" >> "$output_file"

# Process the input file to extract and format the charset lines
awk '/^charset/ {
    split($2, a, "=");
    gsub(/charset |\.nexus|'\''/, "", $2);  # Remove charset, .nexus, and single quotes
    print $2 " = " $4;
}' "$input_file" >> "$output_file"

# Add the [schemes] section to the output
echo "[schemes]" >> "$output_file"
echo "search = rcluster;" >> "$output_file"

Using the newly made .cfg file, run the following code:

#bash
/PATH/TO/PartitionFinder.py /PATH/TO/partition_finder.cfg --raxml --rcluster-max 100 --no-ml-tree

OUTPUT

log.txt, log file
analysis/best_scheme.txt, file listing loci along with their corresponding best model.

4.3 Concatenated Phylogeny

When analyzing target-enriched sequence data for phylogenetic analyses, multiple approaches may be implemented. For a concatenated approach all gene sequences are combined into a single matrix. This method assumes that all trees share the same evolutionary history, and therefore may not always be ideal for identifying ancient speciation events where gene trees are highly divergent. This tutorial presents instructions for a concatenated phylogeny using RAxML (Stamatakis, 2014), a maximum likelihood phylogenetic inference tool. Phylogenetic trees may be visualized by an array of tools, eg. FigTree (http://tree.bio.ed.ac.uk/software/figtree/).

4.3.1 Installing RAxML

Installation instructions can be found at https://cme.h-its.org/exelixis/web/software/raxml/cluster.html.

4.3.2 Running RAxML on a concatenated matrix

INPUT

*.phylip, tree file (output from Phyluce)

Note: Change the -m function depending on the best fitting model (GTR model: GTRCAT; GTR+G model: GTRGAMMA; GTR+I+G model: GTRGAMMAI).

#bash
/PATH/TO/raxmlHPC-PTHREADS-SSE3 -T $SLURM_CPUS_PER_TASK -f a -p 12345 -x 12345 -m GTRGAMMAI -# 100 -s /PATH/TO/mafft-nexus-trimmed-raxml.phylip -n out

OUTPUT

RAxML_bipartitionsBranchLabels.out
RAxML_bootstrap.out
RAxML_info.out
RAxML_bestTree.out
RAxML_bipartitions.out

(#fig:concat_tree)Phylogenetic tree produced from RAxML analysis on a concatenated matrix visualized using FigTree.

4.4 Coalescent-based Phylogeny

In contrast to a concatenated approach, a pseudo-coalescent phylogenetic approach analyzes individual gene trees, accounting for the high probability that different genes likely have different evolutionary histories. This method may be better suited for resolving phylogenetic relationships in the presence of incomplete lineage sorting and hybridization, but is more computationally demanding than concatenated methods. One method to generate a coalescent phylogeny is by using Astral-III (Zhang et al., 2018) on a suite of RAxML (Stamatakis, 2014) gene trees constructed under different evolutionary models.

4.4.1 Installing RAxML

Installation instructions can be found at https://cme.h-its.org/exelixis/web/software/raxml/cluster.html.

4.4.2 Running RAxML for pseudo-coalescent analyses

When running RAxmL for a pseudo-coalescent (hereafter referred to as “ASTRAL”) analysis, individual loci should be analyzed separately using the most appropriate model as indicated in the results of PartitionFinder.

INPUT

uce-*.nexus: individual locus files in nexus format (output from Phyluce)
analysis/best_scheme.txt: file denoting the best model for each locus (output from Phyluce)

To organize all loci files by model type into new directories, the following code can be utilized. In this code, loci files (eg. uce-*.nexus) will be moved into folders labeled “batch_exports”.

#bash
#make folders of each model type in main working directory
mkdir astral_raxml
cd astral_raxml

mkdir bestTree
mkdir bootTree
mkdir Astral_try

mkdir GTR
mkdir GTRG
mkdir GTRIG
cd GTR
mkdir GTR_batch_exports
mkdir output
cd ../GTRG
mkdir GTRG_batch_exports
mkdir output
cd ../GTRIG
mkdir GTRIG_batch_exports
mkdir output

#edit the best_scheme.txt file to be used for designating loci files to proper model folder
tail -n +22 best_scheme.txt| sed -n '/^$/q;p' > best_scheme_uce_table.txt

Move each loci folder to its designated model folder using the following R code. Change the uce location paths to match your own directory organization.

#R

df <- read.table(file = "best_scheme_uce_table.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows = 200000)

df$Best.Model <- gsub(" ", "", df$Best.Model, fixed = TRUE)
df$Partition.names <- gsub(" ", "", df$Partition.names, fixed = TRUE)

GTR <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR")],",")))
GTRG <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR+G")],",")))
GTRIG <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR+I+G")],",")))

model_GTR <- cbind(GTR, rep("GTR", n= length(GTR)))
model_GTRG <- cbind(GTRG, rep("GTRG", n= length(GTRG)))
model_GTRIG <- cbind(GTRIG, rep("GTRIG", n= length(GTRIG)))

model_df <- as.data.frame(rbind(model_GTR, model_GTRG, model_GTRIG))
colnames(model_df) <- c("uce", "model")


uce_original_location <- "/PATH/TO/taxon-set-all/mafft-nexus-trimmed-clean-100p/"
uce_model_location <- "/PATH/TO/astral_raxml/"
files <- list.files(uce_original_location)

for(i in 1:length(files)) {
  
  file_name <- files[i]
  file_uce <- gsub(".nexus", "", file_name)
  file_model <- model_df$model[which(model_df$uce %in% file_uce)]
  file.copy(from = paste(uce_original_location, file_name, sep = ""),to = paste(uce_model_location, file_model, "/", file_model, "_batch_exports", sep = ""))
  
}

After all the gene files have been moved into their appropriate model choice batch_export/ folder, the file type and ending will need to be changed from .nexus to .fasta. This can be done using the tool ElConcatenero (https://github.com/ODiogoSilva/TriFusion) using the following code.

#bash
python /home/USER/ElConcatenero/ElConcatenero.py -c -if nexus -of fasta -in *.nexus

RAxML can then be run on the individual, aligned gene files in a loop. The -m flag will need to be adjusted depending on the best fitting model (GTR model: GTRCAT; GTR+G model: GTRGAMMA; GTR+I+G model: GTRGAMMAI).

#bash
DIR_I=/PATH/TO/astral_raxml/GTR/batch_exports/*

for f in $DIR_I
do
echo "Processing $f"
file_name=$(basename $f)
/PATH/TO/raxmlHPC-PTHREADS-SSE3 -T 4 -f a -p 12345 -x 12345 -m GTRCAT -# 100 -s $f -n $file_name -w /PATH/TO/astral_raxml/GTR/output
done

OUTPUT

/output/RAxML_bootstrap*: bootstrap files for each model type
/output/RAxML_bestTree*: best tree files for each model type

Output from these RAxML (Stamatakis, 2014) analyses will be used as input for the ASTRAL-III (Zhang et al., 2018) analysis.

4.4.3 Installing ASTRAL-III

Installation instructions can be found at https://github.com/smirarab/ASTRAL.

4.4.4 Running ASTRAL-III

INPUT

RAxML_bestTree*, tree files (output from RAxML)
RAxML_bootstrap*, bootstrap files (output from RAxML)

After the output files are produced from RAxML, copy the RAxML_bestTree* and RAxML_bootstrap* files into their own folders. The file endings will then need to be changed and concatenated. The following code may be used to prepare files for the ASTRAL analysis.

#bash
#make new folders
cd /PATH/TO/astral_raxml/
mkdir bestTree
mkdir bootTree

#copy files to new folders
cp /PATH/TO/GTR/output/RAxML_bootstrap* ./bootTree/
cp /PATH/TO/GTR/output/RAxML_bestTree* ./bestTree/

#change fasta extension to .tre extension
cd bestTree
for f in *.fasta; do
mv -- "$f" "${f%.fasta}.tre"
done
cd bootTree
for f in *.fasta; do
mv -- "$f" "${f%.fasta}.tre"
done

#concatenate the .tre files
cd bestTree
cat RAxML_bestTree* > concat_best.tre 

#make a new folder named Astral_try/ and copy the concatenated best tree file into it
cd /PATH/TO/astral_raxml/
mkdir Astral_try
cd Astral_try
cp ../bestTree/concat_best.tre .

(Optional) We have discontinued running bootstrap analyses since the creators of ASTRAL-III stated that LPP support values are more reliable than bootstrap, however the code is provided if you would like to use it. For a bootstrap analysis, you also need to make a bs-files file which designates where the RAxML_bootstrap*.tre files are located (e.g., bootTree/).

#example bs_files
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1000.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1005.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1007.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1008.tre
# remaining bootstrap file locations

Code to make this bs-files is below.

#bash
cd /PATH/TO/astral_raxml/bootTree/
ls *.tre > bs-files
pwd #copy the directory!
sed -i 's\RAxML\==/PATH/TO/astral_raxml/bootTree/==RAxML\g' bs-files #paste your working directory where it is highlighted!
mv bs-files ../Astral_try/

ASTRAL-III can now be run with the following code:

#bash
#Run ASTRAL
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar  -i concat_best.tre -o Astral_best.tre 2> astral_best.log

#Generate LPP values
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar  -q Astral_best.tre  -i concat_best.tre -o Astral_lpp.tre 2> astral_lpp.log

#Get bootstrap values (optional)
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar  -i concat_best.tre -b bs-files -o Astral_bootcount.tre 2> astral_bootcount.log

**OUTPUT

*.log: log file
Astral.lpp.tre: final tree with LPP support values.

(#fig:astral_tree)Phylogenetic tree produced from ASTRAL-III analysis visualized using FigTree.

4.5 Orthology-based Phylogeny

Another tool for generating phylogenetic trees within a coalescent framework is OrthoFinder (Emms and Kelly, 2019). This program identifies orthogroups, infers rooted trees for all orthogroups, identifies gene duplication events, and infers a rooted species tree.

4.5.1 Installing OrthoFinder

#bash
# Install program with Bioconda
conda create --name orthofinder
conda activate orthofinder
conda install -c bioconda orthofinder

4.5.2 Running OrthoFinder

INPUT

*.pep: peptide files

#bash
/PATH/TO/OrthoFinder_source/orthofinder.py -f ./pep/ -t 64 -a 16 -M msa -T raxml

OUTPUT

SpeciesTree_rooted.txt: rooted tree file

Refer to https://davidemms.github.io/orthofinder_tutorials/exploring-orthofinders-results.html for more detailed description of results.

4.6 Chloroplast Phylogeny

Phylogenetic trees from chloroplast sequence data can be generated in a number of different ways. This tutorial describes a method for plastome phylogenetic estimation using target-enriched sequencing data. As mentioned above in the plastome assembly section, there is often insufficient off-target sequence data for complete organellar assembly.

4.6.1 Installing Bowtie2, Samtools, BCFtools, seqtk

#bash
# Install program with Bioconda
conda create --name mapping
conda activate mapping
conda install bioconda::bowtie2
conda install bioconda::samtools
conda install bioconda::bcftools
conda install bioconda::seqtk

INPUT

*.tp.fastq: the trimmed sequence data

4.6.2 Running Bowtie2, Samtools, BCFtools, seqtk

This tutorial describes a process of mapping off-target sequence reads to a reference plastome in order to create consensus sequences which may then be aligned and used as input for phylogenetic analyses.

INPUT

reference.fasta, a reference plastome can be downloaded from GenBank or used from previous GetOrganelle analyses
*.tp.fastq, the trimmed sequence data for all samples

First, a reference database will need to be constructed using Bowtie2. This can be done using the following code:

#bash
conda activate mapping

bowtie2-build reference_plastome.fasta reference_plastome

With the resulting reference database, the paired-end target-enriched sequence reads may be mapped onto the reference plastome. To do this in a loop with all reads with file endings “_1_tp.fastq” and “_2_tp.fastq”, use the following code:

#bash
conda activate mapping

#in your working directory, make folders for output of this pipeline
mkdir sam_files
mkdir bam_files
mkdir fastq
mkdir consensus_fastas

#run the pipeline in a loop on all paired files
for fileR1 in *_1_tp.fastq
do
fileR2=`echo ${fileR1} | sed 's/_1_/_2_/'`
filename=`echo ${fileR1} | sed 's/_1_.*//'`

#map reads to reference plastome
bowtie2 --very-sensitive -p 24 -x /PATH/TO/reference_plastome -1 "$fileR1" -2 "$fileR2" -S /PATH/TO/sam_files/"$filename".sam

#convert sam files to bam files
cat /PATH/TO/sam_files/"$filename".sam | samtools view -bS - | samtools sort -o /PATH/TO/bam_files/"$filename".bam

#Get consensus fastq file
bcftools mpileup -f /PATH/TO/reference_plastome.fasta /PATH/TO/bam_files/"$filename".bam | bcftools call -c | vcfutils.pl vcf2fq > /PATH/TO/fastq/"$filename"_cns.fastq

#vcfutils.pl is part of bcftools

#Convert .fastq to .fasta 
seqtk seq -a /PATH/TO/fastq/"$filename"_cns.fastq > /PATH/TO/consensus_fastas/"$filename"_cns.fastq
done

The scaffold names need to be changed to match the name of the sample.

#R
files <- list.files(pattern= "fasta")

for (i in 1:length(files)) {
  
  # 2. Set file name
  f <- files[i]
  
  # 3. Load fasta 
  fasta <- readLines(f) 
  
  # replace scaffold name with sample id
  
  id <- gsub("_cns.fasta", "", f)
  
  new_scaf_name <- paste0(">", id)
  
  fasta[1] <- new_scaf_name
  
  # 7. Overwrite consensus file with new scaffold names
  write.table(fasta, paste0(id, "_cns.fasta"), row.names = F, col.names=F, quote = F)
}  
  # 8. Quit R
  quit(save="no")

Now all of the consensus files can be concatenated and aligned for phylogenetic analysis. In this tutorial, MAFFT (Katoh and Standley, 2013) will be used to align sequences.

4.6.3 Installing MAFFT

Installation instructions can be found at https://mafft.cbrc.jp/alignment/software/source.html.

4.6.4 Running MAFFT

Concatenate all sequences and align them using MAFFT.

#bash
cd /PATH/TO/consensus_fastas

cat *.fasta > all_cns.fasta

mafft all_cns.fasta > all_cns_aligned.fasta

Phylogenetic analyses may be run on the aligned sequences in multiple ways. One user-friendly way is by using IQTREE (Minh et al., 2020). Comprehensive instructions are provided at http://www.iqtree.org/doc/iqtree-doc#beginners-tutorial.

4.6.5 Installing IQTREE

#bash
conda create --name iqtree
conda activate iqtree
conda install -c bioconda iqtree

4.6.6 Running IQTREE

INPUT

all_cns_aligned.fasta: matrix of the aligned sequences

#bash
conda activate iqtree

iqtree -s all_cns_aligned.fasta -B 1000

OUTPUT

*.phy.iqtree: the main report file
*.phy.treefile: the ML tree
*.phy.log: log file

(#fig:cp_tree)Phylogenetic tree produced from RAxML analysis on a concatenated matrix visualized using FigTree.

4.7 Visualizing phylogenetic trees

Resulting phylogenetic trees may be visualized by an array of tools, like FigTree (http://tree.bio.ed.ac.uk/software/figtree/) or through R packages such as phytools (Revell, 2012) and ape (Paradis and Schliep, 2019). This tutorial describes how to compare and visualize phylogenetic trees to identify incongruent or divergent evolutionary histories using phytools and ape.

4.7.1 Installing FigTree

Installation instructions are located at https://evomics.org/resources/software/molecular-evolution-software/figtree/.

INPUT

tree file

OUTPUT

visualizations may be exported in multiple formats

4.7.2 Installing phytools, ape

#R
R
install.packages("phytools")
install.packages("ape")

4.7.3 Running phytools, ape

INPUT

two tree files that contain tips with the same names

#R
R
library(phytools)
library(ape)

tree1 <- read.tree("concatenated.tre")
tree2 <- read.tree("coalescent.tre")

#root both trees to the outgroup
tree1 <- root(tree1,"outgroup", resolve.root = T)
tree2 <- root(tree2,"outgroup", resolve.root = T)

#compare both trees
obj<-cophylo(tree1, tree2, fn=function(x) abs(x),print=TRUE, use.edge.length = TRUE)

#plot a tanglegram comparing both trees
pdf("concat_v_coal_tree.pdf")
plot(obj,link.type="curved",link.lwd=2,link.lty="solid",
     link.col=make.transparent("blue",0.25),fsize=0.6)
title(main= "Concatenated vs. Coalescent Nuclear Tree", cex.main= .75, line = -.79)
dev.off()

For more detailed instructions on how to compare phylogenetic trees, refer to http://blog.phytools.org/2016/08/empirical-co-phylogenetic-cophylo.html.

OUTPUT

pdf of tree comparison

Figure 4.1: Comparison of concatenated and coalascent trees produced using phytools and ape in R

5 Identifying phylogenetic discordance

Phylogenomic incongruence is routinely observed in phylogenomics studies and typically results from underlying gene tree discordance. Below are tutorials on how to investigate the extent of gene tree discordance among phylogenies using PhyParts (Smith et al., 2015), which summarizes how many gene trees agree/disagree with the final species topology, and for Quartet Sampling (Pease et al., 2018), which uses quartets to investigate the cause of low support at each node.

5.1 PhyParts

PhyParts (Smith et al., 2015), along with PhypartsPieCharts (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts), are tools that indicate the percentage of concordant gene trees, percentage in the top alternative bipartition, other conflicting topologies, and missing and uninformative genes as pie charts on the nodes of a species phylogeny. This provides a visualization of discordant nodes resulting from gene tree conflict. New python scripts have also been produced that show missing and uninformative data as two separate pie slices (see https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/), providing an even better visualization of the underlying data. Below we will show how to run PhyParts on the ASTRAL tree with the original PhypartsPieCharts script, but also with this new python script that has the extra slice.

To prepare the input data for PhyParts, Phyx (Brown et al., 2017) can be used to root the species and gene trees.

5.1.1 Installing Phyx

Installation instructions are located at https://github.com/FePhyFoFum/phyx.

5.1.2 Running Phyx

INPUT

Astral_best.tre: species tree output from ASTRAL-III
/astral_raxml/bestTree/*.tre: gene tree files output from ASTRAL-III

Phyx (Brown et al., 2017) is a very helpful phylogenetic tool that has many functions. For the purposes of running PhyParts, only the ‘rerooting and unrooting tree’ function, pxrr, needs to be used. Designating the proper taxa/taxon as outgroups for rooting, the following code may be used:

#bash
#for a single taxon rooting
pxrr -t Astral_best.tre -g outgroup_taxon > Astral_best_rooted.tre 
#for rooting to multiple taxa
pxrr -t Astral_best.tre -g outgroup_taxon1 outgroup_taxon2 outgroup_taxon3 … > Astral_best_rooted.tre 

#running pxrr on the gene trees in a loop 
for t in *.tre
do
pxrr -t $t -g outgroup_taxon > $t.rooted.tre
done

#Not all gene trees may include your outgroup taxon, so the number of gene trees may decrease from the original starting point. The empty gene trees must be removed. A quick, one-liner code for this is:
find . -maxdepth 1 -type f -empty -print -delete

OUTPUT

Astral_best_rooted.tre: rooted species tree
*.rooted.tre: rooted gene trees

5.1.3 Installing PhyParts

General installation instructions can be found at https://bitbucket.org/blackrim/phyparts/src/master/, however, code is below to make a conda environment called ‘phyparts’. Additionally, the GitHub repository needs to be cloned to obtain the python scripts to run PhyParts.

#bash
conda create -n phyparts python=2.7
pip install ete3
pip install matplotlib

git clone https://bitbucket.org/blackrim/phyparts.git

5.1.4 Running PhyParts

INPUT

Astral_best_rooted.tre: rooted concatenated species tree from Phyx
*.rooted.tre: rooted gene trees from Phyx

#bash
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d /PATH/TO/phyluce/PhyParts/root/ -m /PATH/TO/Astral_best_rooted.tre -o output

OUTPUT

out.concon.tre
out.concord.node.1
out.conflict.node.0
out.conflict.node.2
out.hist.alts
out.concord.node.0
out.concord.node.2
out.conflict.node.1
out.hist
out.node.key

Detailed information on the output of PhyParts can be found at https://bitbucket.org/blackrim/phyparts/src/master/.

5.1.5 Running PhyPartsPieCharts

To create figures for publication, Matt Johnson (Texas Tech University) generated a visualization script that can be used, ‘phypartspiecharts.py’ from https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts, to produce svg images.

INPUT

#bash
#copy python code from github
git clone https://github.com/mossmatters/MJPythonNotebooks.git
#activate conda environment
conda activate phyparts

#combine all outfiles from PhyParts
cat *.concord.node.* > phyparts.concord.tre
cat *.conflict.node.* > phyparts.conflict.tre
cat *.hist > phyparts.hist
cat *.hist.alts > phyparts.hist.alts
cat *.node.key > phyparts.node.key
cat *.
concon.tre > phyparts.concon.tre

#Now you just need the rooted species tree (Astral_best_rooted.tre), what the different outfiles can be referred to as (e.g., phyparts), and knowledge on how many gene trees were used (e.g., 1000 gene trees).

/PATH/TO/phyluce/PhyParts/phypartspiecharts.py /PATH/TO/phyluce/PhyParts/Astral_best_rooted.tre phyparts 1000 --svg_name phyparts_piechart.svg

OUTPUT

*.svg: file containing your tree with pie charts at each node indicating:
- Blue: concordant
- Green: the top alternative bipartition
- Red: all other alternative bipartitions
- Black: uninformative for that node

Figure 5.1: Phylogenetic discordance visualized using PhyParts

Alternatively, you can run a new python script from Diego Morales-Briones, ‘phypartspiecharts_missing_uninformative.py’ from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/, to separate the missing and uninformative data into two separate pie charts. To do this, you need to re-run PhyParts as two separate runs. Additionally, you need to use the bootstrap gene trees instead of the best trees and root them with Phyx as described above.

INPUT

Astral_best_rooted.tre: rooted species tree from Phyx
/astral_raxml/bootTree/root/*.tre: rooted gene tree files with bootstrap supports from ASTRAL-III

#bash
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d /astral_raxml/bootTree/root/ -m Astral_best_rooted.tre -o concat50_noBS
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -s 50 -d /astral_raxml/bootTree/root/ -m Astral_best_rooted.tre -o concat50_BSfull #NOTE: 50 is for 50 BS support. You can alter this number or use different support if using LPP or PP to better fit your data.

From there, you will want to go to R and follow the script provided that alters the *concon.tre.

#R
#Read in trees
read.tree("astral50_BSfull.concon.tre")->bs_full_concordance
read.tree("astral50_noBS.concon.tre")->No_bs

#run code
total_no_bs<-No_bs[[1]] # get a tree to add total node numbers

total_no_bs$node.label<-mapply("+",as.numeric(No_bs[[1]]$node.label), as.numeric(No_bs[[2]]$node.label)) #get total number of nodes
total_no_bs$node.label[is.na(total_no_bs$node.label)] <- "" #remove NA values
total_no_bs$node.label[total_no_bs$node.label=="0"]<-"" #remove 0 values. to avoid divisions by zero.

append(bs_full_concordance, total_no_bs, after=2)-> full_concordance_and_total_nodes

#write tree
write.tree(full_concordance_and_total_nodes, file = "astral50_all_homologs.concon.tre") #write tree - this will replace the orignal *.concon.tre file.

Upload this new .concon.tre file to where you ran PhyParts and rename it to match the original run.

#bash
#rename tree to match the old run
mv astral50_all_homologs.concon.tre astral50_BSfull.concon.tre

#Run the new python script now on the data, following the same guidelines as phyparts_piechart.py:
python phypartspiecharts_missing_uninformative.py Astral_best_rooted.tre astral50_BSfull 1000 #change 1000 to however many gene trees were used as input

OUTPUT

*.svg: file containing your tree with pie charts at each node indicating:
- Blue: concordant
- Green: the top alternative bipartition
- Red: all other alternative bipartitions
- Dark gray: uninformative for that node
- Light gray: missing data for that node

(#fig:phyparts_new)Phylogenetic discordance visualized using PhyParts with the new python script adding an additional pie slice separating uninformative to informative genes

5.2 Quartet Sampling

Quartet Sampling (QS) (Pease et al., 2018) is a quartet-based method designed to assess the confidence, consistency, and informativeness of internal tree relationships, and the reliability of each terminal branch. QS determines if a phylogeny has a lack of support due to insufficient information, discordance from ILS or introgression, or misplaced or erroneous taxa. Three scores are calculated (QC, QD, QI) for each internal branch of the focal tree which together allows different interpretations of the branches, as well as a Quartet Fidelity (QF) score that reports how frequently a taxon is included in concordant topologies. QS efficiently synthesizes phylogenetic support tests and offers more comprehensive and specific information on branch support than conventional measures (i.e., bootstrap).

5.2.1 Installing Quartet Sampling

#bash
#create a conda environment for installation
conda create -n quartetsampling numpy
conda activate quartetsampling
conda install -c bioconda python
conda install -c bioconda raxml-ng 

#clone the GitHub to obtain the python scripts to run Quartet Sampling
git clone https://github.com/FePhyFoFum/quartetsampling.git

5.2.2 Running Quartet Sampling

INPUT

mafft-nexus-trimmed-raxml.phylip: alignment file (output from Phyluce)
Astral_best.tre: best tree file without support values (output from Astral-III)

Quartet sampling may be run using the following code with the default parameter values. For more information on adjusting the parameters see https://github.com/FePhyFoFum/quartetsampling/blob/master/quartetsampling.pdf.

#bash
python /PATH/TO/quartetsampling/pysrc/quartet_sampling.py --tree /PATH/TO/Astral_best.tre --align /PATH/TO/mafft-nexus-trimmed-raxml.phylip --reps 100 --threads 4 --lnlike 2

OUTPUT

RESULT.labeled.tre
RESULT.labeled.tre.freq
RESULT.labeled.tre.qd
RESULT.node.counts.csv
RESULT.run.stats
RESULT.labeled.tre.figtree
RESULT.labeled.tre.qc
RESULT.labeled.tre.qi
RESULT.node.scores.csv

Each file contains useful information about the different scores generated by QS. The final tree file, RESULT.labeled.tre.figtree, provides all values (QC, QI, QD, and QF). This file can be put into FigTree to visualize discordance.

(#fig:qs_tree)Quartet Sampling results of ML concatenated tree. The Quartet Concordance (QC), Quartet Differential (QD), and Quartet Informativeness (QI) scores are found at each node as QC/QD/QI.

6 Investigating Ancient Hybridization

Phylogenetic trees represent evolutionary relationships as a branching structure, however, in reality, evolutionary processes are more complex, including processes such as hybridization. Unlike trees, phylogenetic networks can model reticulation by incorporating nodes representing gene flow or hybridization events and may offer a more accurate representation of evolutionary histories. To investigate ancient hybridization, this tutorial will provide instructions for using PhyloNetworks (Solís-Lemus et al., 2017), a Julia package developed to facilitate the inference of networks from genetic data, manipulation, visualization, and analysis of network features.

6.1 Phylogenetic networks

PhyloNetworks (Solís-Lemus et al., 2017) is computationally very intense. Given this, it is often suggested to subset your species and gene trees into smaller trees and run those individually if you have a large tree. For more detailed information on running and visualizing phylogenetic networks see https://juliaphylo.github.io/PhyloNetworks.jl/stable/.

6.1.1 Install PhyloNetworks

#julia
julia

using Pkg
Pkg.add("PhyloNetworks")
Pkg.add("RCall")
Pkg.add("PhyloPlots")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("StatsModels")
Pkg.add("Gadfly")
Pkg.add("Plots")

INPUT

Astral_lpp.tre: species tree (output from Astral)
concat_best.tre: concatenated gene tree matrix (output from Astral)
bs-files: folder with all bootstrap files (output from Astral)

6.1.2 Running PhyloNetworks

#julia
julia

using PhyloNetworks
using PhyloPlots
using CSV
using DataFrames
using StatsModels
using Gadfly
using RCall
using Plots

#load your gene tree matrix and species tree
genetrees = readMultiTopology("concat_best.tre")
q,t = countquartetsintrees(genetrees);
df = writeTableCF(q,t)
CSV.write("tableCF.csv", df);
raxmlCF = readTableCF("tableCF.csv")
astraltree = readMultiTopology("Astral_lpp.tre")
T=readTopologyLevel1("Astral_lpp.tre")

#start with 0-10 potential hybridizations on your PhyloNetworks run.

net0 = snaq!(astraltree[1], raxmlCF, hmax=0, filename="net0", seed=1234)
net1 = snaq!(net0, raxmlCF, hmax=1, filename="net1", seed=2345)
net2 = snaq!(net1, raxmlCF, hmax=2, filename="net2", seed=3456)
net3 = snaq!(net2, raxmlCF, hmax=3, filename="net3", seed=4567)
net4 = snaq!(net3, raxmlCF, hmax=4, filename="net4", seed=5678)
net5 = snaq!(net4, raxmlCF, hmax=5, filename="net5", seed=6789)
net6 = snaq!(net5, raxmlCF, hmax=6, filename="net6", seed=6789)
net7 = snaq!(net6, raxmlCF, hmax=7, filename="net7", seed=6789)
net8 = snaq!(net7, raxmlCF, hmax=8, filename="net8", seed=6789)
net9 = snaq!(net8, raxmlCF, hmax=9, filename="net9", seed=6789)
net10 = snaq!(net9, raxmlCF, hmax=10, filename="net10", seed=6789)

#determine the best number of hybridizations by plotting teh log likelihood scores

#reload output files
net0 = readTopology("net0.out")
net1 = readTopology("net1.out")
net2 = readTopology("net2.out")
net3 = readTopology("net3.out")
net4 = readTopology("net4.out")
net5 = readTopology("net5.out")

#make a list of the scores
scores = [net0.loglik, net1.loglik, net2.loglik, net3.loglik, net4.loglik, net5.loglik]

#plot the scores and save
R"pdf"("score-vs-h.pdf", width=4, height=4);
R"plot"(x=0:5, y=scores, type="b", xlab="number of hybridizations h", ylab="network score");
R"dev.off"();

After plotting the log likelihood scores, the optimal number of hybridizations can be estimated by identifying the lowest p-value. If the data plateaus, then choose the first value at the plateau.

After identifying the best number of hybridizations without bootstraps, PhyloNetworks can be run with bootstraps. This step is computationally limiting step and can take days to weeks to months to finish.

#julia
julia

using PhyloNetworks
using PhyloPlots
using CSV
using DataFrames
using StatsModels
using Gadfly
using RCall
using Plots

#load your gene tree matrix and species tree
net2 = readTopology("net2.out")
bootTrees = readBootstrapTrees("bs-files")
bootnet2 = bootsnaq(net2, bootTrees, hmax=2, nrep=100, runs=10, filename="bootsnaq2_raxmlboot", seed=2345)

The final phylonetwork can be visualized using:

#julia
julia

using PhyloNetworks, PhyloPlots, RCall

#read in network and bootstrap files
net2 = readTopology("net2.out")
bootnet2 = readMultiTopology("bootsnaq2_raxmlboot.out")

#visualize the network with edge and node labels
imagefilename = "phylonetwork_edges_nodes.pdf"
R"pdf"(imagefilename, width=8, height=6)
plot(net2, showedgenumber=true, shownodenumber=true); 
R"dev.off()"

#root the network on the edge leading to the outgroup (edge=3)
rootonedge!(net2, 3) 

#visualize the rooted network with edge and node labels
imagefilename = "phylonetwork_edges_nodes_rooted.pdf"
R"pdf"(imagefilename, width=8, height=6)
plot(net2, showedgenumber=true, shownodenumber=true); 
R"dev.off()"

#rotate the branches so that hybrid edges do not cross, (rotate on node 11 and -5)
rotate!(net2, 11)
rotate!(net2, -5)

#visualize the rooted, rotated network with hybrid edge values
imagefilename = "phylonetwork_rooted_rotated.pdf"
R"pdf"(imagefilename, width=16, height=8)
plot(net2, showgamma=true); 
R"mtext"("Phylonetwork Net2", line=-1, cex =2);
R"dev.off()"

#add bootstrap support
BSe_tree_net2, tree1 = treeEdgesBootstrap(bootnet2, net2)

imagefilename = "net2_bs.pdf"
R"pdf"(imagefilename, width=16, height=8)
plot(net2, edgelabel = BSe_tree_net2, showgamma=true, edgecolor="black");
R"mtext"("Phylonetwork Net2", line=-1, cex =2);
R"dev.off()"

Figure 6.1: Phylonetwork

7 Inferring Whole Genome Duplication

Mutiple methods may be used to infer whole genome duplication events in the evolutionary history of organisms. Most often, Ks-based or synteny-based methods are used. This tutorial will provide a guide for Ks-based analyses to infer WGD from transcriptomic data.

7.1 WGDv2

WGDv2 (Chen et al., 2024) is a package used to infer and place dates on ancient whole-genome duplication (WGD) events. This tutorial describes using wgd v2 exclusively to generate Ks plots to estimate the evolutionary age of WGD events using transcriptome data that have first been translated into coding sequences using TransDecoder (https://github.com/TransDecoder/TransDecoder).

7.1.1 Installing TransDecoder

Installation instructions can be found at https://github.com/TransDecoder/TransDecoder. Additionally, it can be installed using Singularity.

#bash
wget https://data.broadinstitute.org/Trinity/CTAT_SINGULARITY/MISC/TransDecoder/transdecoder.v5.7.1.simg

#to run with singularity, enter this before the command
singularity exec -e ~/TransDecoder/transdecoder.v5.7.1.simg

7.1.2 Running TransDecoder

INPUT

assembled transcripts in fasta format (output from Trinity)

#bash
#extract the long open reading frames
TransDecoder.LongOrfs -t assembled_transctipts.fasta --output_dir output

#predict the likely coding regions
TransDecoder.Predict -t assembled_transctipts.fasta --output_dir output

OUTPUT

*.fasta.transdecoder.pep: peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
*.fasta.transdecoder.cds: nucleotide sequences for coding regions of the final candidate ORFs
*.fasta.transdecoder.gff3: positions within the target transcripts of the final selected ORFs
*.fasta.transdecoder.bed: bed-formatted file describing ORF position.

7.1.3 Installing WGDv2

The wgdV2 repository can be cloned from github to create a conda environment.

#bash
#clone repository 
git clone https://github.com/heche-psb/wgd 

#create conda environment and install programs
conda create -n wgdv2 python=3.8
conda activate wgdv2
pip install numpy==1.19.0
pip install wgd==2.0.38
conda install diamond #future dependencies
conda install bioconda::mafft #future dependencies
conda install bioconda::mcl #future dependencies

7.1.4 Running WGDv2

INPUT

*cds, coding sequence files (output from TransDecoder)

#bash
#Construction of age distribution and ELMM analysis
wgd dmd transcripts.transdecoder.cds -o wgd_dmd
wgd ksd wgd_dmd/transcripts.transdecoder.cds.tsv transcripts.transdecoder.cds -o wgd_ksd
wgd viz -d wgd_ksd/transcripts.transdecoder.cds.tsv.ks.tsv -o wgd_ELMM

OUTPUT

wgd_ELMM/
- *.tsv.ks.tsv.ksd.pdf/svg
- *.tsv.ks.tsv.spline_node_averaged.pdf/svg
- *.tsv.ks.tsv.spline_weighted.pdf/svg
- *.tsv.ks.tsv_peak_detection_node_averaged.pdf/svg
- *.tsv.ks.tsv_peak_detection_weighted.pdf/svg
- elmm_*.tsv.ks.tsv_best_models_node_averaged.pdf/svg
- elmm_*.tsv.ks.tsv_best_models_weighted.pdf/svg
- elmm_*.tsv.ks.tsv_models_data_driven_node_averaged.pdf/svg
- elmm_*.tsv.ks.tsv_models_data_driven_weighted.pdf/svg
- elmm_*.tsv.ks.tsv_models_random_node_averaged.pdf/svg
- elmm_*.tsv.ks.tsv_models_random_weighted.pdf/svg
- elmm_BIC_*.tsv.ks.tsv_node_averaged.pdf/svg
- elmm_BIC_*.tsv.ks.tsv_weighted.pdf/svg

Figure 7.1: Ks plot for Schlechtendalia luzulifolia produced using wgdV2

8 References

Bolger, A.M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics 30: 2114–2120.

Brown, J.W., J.F. Walker, and S.A. Smith. 2017. Phyx: Phylogenetic tools for unix. Bioinformatics 33: 1886–1888.

Chen, H., A. Zwaenepoel, and Y.V. de Peer. 2024. Wgd v2: A suite of tools to uncover and date ancient polyploidy and whole-genome duplication. Bioinformatics 40:

Emms, D.M., and S. Kelly. 2019. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biology 20:

Faircloth, B.C. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32: 786–788.

Grabherr, M.G., B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, X. Adiconis, et al. 2011. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnology 29: 644–652.

Jin, J.J., W.B. Yu, J.B. Yang, Y. Song, C.W. Depamphilis, T.S. Yi, and D.Z. Li. 2020. GetOrganelle: A fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biology 21: 1–31. Available at: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02154-5.

Johnson, M.G., E.M. Gardner, Y. Liu, R. Medina, B. Goffinet, A.J. Shaw, N.J.C. Zerega, and N.J. Wickett. 2016. HybPiper: Extracting coding sequence and introns for phylogenetics from high‐throughput sequencing reads using target enrichment. Applications in Plant Sciences 4:

Katoh, K., and D.M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30: 772–780. Available at: /pmc/articles/PMC3603318/ /pmc/articles/PMC3603318/?report=abstract https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3603318/.

Lanfear, R., P.B. Frandsen, A.M. Wright, T. Senfeld, and B. Calcott. 2016. PartitionFinder 2: New methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Molecular Biology and Evolutionmsw260.

Minh, B.Q., H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A.V. Haeseler, R. Lanfear, and E. Teeling. 2020. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Molecular Biology and Evolution 37: 1530–1534. Available at: https://dx.doi.org/10.1093/molbev/msaa015.

Paradis, E., and K. Schliep. 2019. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in r. Bioinformatics 35: 526–528.

Pease, J.B., J.W. Brown, J.F. Walker, C.E. Hinchliff, and S.A. Smith. 2018. Quartet sampling distinguishes lack of support from conflicting support in the green plant tree of life. American Journal of Botany 105: 385–403.

Prjibelski, A., D. Antipov, D. Meleshko, A. Lapidus, and A. Korobeynikov. 2020. Using SPAdes de novo assembler. Current Protocols in Bioinformatics 70:

Revell, L.J. 2012. Phytools: An r package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 3: 217–223.

Smith, S.A., M.J. Moore, J.W. Brown, and Y. Yang. 2015. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants. BMC Evolutionary Biology 15: 150.

Solís-Lemus, C., P. Bastide, and C. Ané. 2017. PhyloNetworks: A package for phylogenetic networks. Molecular Biology and Evolution 34: 3292–3298.

Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30: 1312–1313. Available at: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu033.

Tillich, M., P. Lehwark, T. Pellizzer, E. Ulbricht-Jones, A. Fischer, R. Bock, and S. Greiner. 2017. GeSeq- versatile and accurate annotation of organelle genomes. Nucleic Acids Research 45:

Wick, R.R., M.B. Schultz, J. Zobel, and K.E. Holt. 2015. Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics 31: 3350–3352.

Zhang, C., M. Rabiee, E. Sayyari, and S. Mirarab. 2018. ASTRAL-III: Polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 19: 15–30. Available at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2129-y.

9 Appendix 1

Citations of all R packages used to generate this report.

[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. https://github.com/rstudio/rmarkdown.

[2] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.

[3] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2023. https://www.R-project.org/.

[4] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman and Hall/CRC, 2016. ISBN: 978-1138700109. https://bookdown.org/yihui/bookdown.

[5] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.41. 2024. https://github.com/rstudio/bookdown.

[6] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[7] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.49. 2024. https://yihui.org/knitr/.

[9] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.

[10] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.

Methods for detecting hybridization and phylogenetic discordance

Bioinformatic Tutorial

Paige A. Ellestad, Erika R. Moore-Pollard, Carolina M. Siniscalchi, Ramhari Thapa, Linda E. Watson, J. Mauricio Bonifacino, Jennifer R. Mandel

1 Introduction

2 Preparing genome sequence data

2.1 Assessing raw sequence quality

2.1.1 Installing FastQC

2.1.2 Running FastQC

2.2 Trimming sequence data

2.2.1 Installing Trimmomatic

2.2.2 Running Trimmomatic

3 Assembling Sequence Data

3.1 Assembling genomes from target-enriched sequencing data

3.1.1 Installing SPAdes

3.1.2 Running SPAdes

3.2 Assembling transcriptomes from RNA-seq data

3.2.1 Installing Trinity

3.2.2 Running Trinity

3.3 Assembling chloroplast genomes from target-enriched sequencing data

3.3.1 Installing GetOrganelle

3.3.2 Running GetOrganelle

3.3.3 Installing Bandage

3.3.4 Running Bandage

3.3.5 Running GeSeq

4 Generating Species Phylogenies

4.1 Identifying ultra-conserved elements from target-enriched data

4.1.1 Installing Phyluce

4.1.2 Running PHYLUCE

4.2 Model Selection

4.2.1 Installing PartitionFinder

4.2.2 Running PartitionFinder

4.3 Concatenated Phylogeny

4.3.1 Installing RAxML

4.3.2 Running RAxML on a concatenated matrix

4.4 Coalescent-based Phylogeny

4.4.1 Installing RAxML

4.4.2 Running RAxML for pseudo-coalescent analyses

4.4.3 Installing ASTRAL-III

4.4.4 Running ASTRAL-III

4.5 Orthology-based Phylogeny

4.5.1 Installing OrthoFinder

4.5.2 Running OrthoFinder

4.6 Chloroplast Phylogeny

4.6.1 Installing Bowtie2, Samtools, BCFtools, seqtk

4.6.2 Running Bowtie2, Samtools, BCFtools, seqtk

4.6.3 Installing MAFFT

4.6.4 Running MAFFT

4.6.5 Installing IQTREE

4.6.6 Running IQTREE

4.7 Visualizing phylogenetic trees

4.7.1 Installing FigTree

4.7.2 Installing phytools, ape

4.7.3 Running phytools, ape

5 Identifying phylogenetic discordance

5.1 PhyParts

5.1.1 Installing Phyx

5.1.2 Running Phyx

5.1.3 Installing PhyParts

5.1.4 Running PhyParts

5.1.5 Running PhyPartsPieCharts

5.2 Quartet Sampling

5.2.1 Installing Quartet Sampling

5.2.2 Running Quartet Sampling

6 Investigating Ancient Hybridization

6.1 Phylogenetic networks

6.1.1 Install PhyloNetworks

6.1.2 Running PhyloNetworks

7 Inferring Whole Genome Duplication

7.1 WGDv2

7.1.1 Installing TransDecoder

7.1.2 Running TransDecoder

7.1.3 Installing WGDv2

7.1.4 Running WGDv2

8 References

9 Appendix 1