Conflicting phylogenetic signals are common in plant phylogenomics and often reflect complex evolutionary histories shaped by processes like hybridization, incomplete lineage sorting, and whole genome duplication. Understanding these patterns is especially challenging in large, diverse families like Asteraceae. This study, along with a growing body of phylogenomic research, emphasizes the role of reticulation and whole-genome duplication in the evolution of large, diverse clades, while also underscoring the challenges. Given the complexity of many of these bioinformatic approaches, a tutorial was developed to guide researchers through the methodological processes, making it applicable to other study systems. Further advancements in bioinformatic approaches will continue to enhance theoretical studies in the field of plant evolution.
This tutorial may serve as a template for researchers aiming to investigate the complex evolutionary processes that influence the diversity of life. Researchers may contact moore.erika.r@gmail.com or pllestad@memphis.edu regarding questions in methodology. This pipeline implements code in multiple languages, mainly bash, R, python, and julia, and often uses conda as a way to quickly and cleanly install packages on a HPC.
To use conda environments on a Linux system, install miniconda3 following the steps below.
#bash
# copy the bash script needed to install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# run the bash script to actually install
bash ~/Miniconda3-latest-Linux-x86_64.sh
# activate your installer
source ~/miniconda3/bin/activate
# make miniconda3 automatically load every time you join your HPC
conda init --all
FastqQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a tool that can spot potential problems in high throughput sequencing datasets. Input can be raw or trimmed sequence files in fastq or bam format. The result of FastQC is an html report which summarises the sequence quality of your samples.
Install FastQC using bioconda with the following code:
#bash
# Install program with Bioconda
conda create --name fastqc
conda activate fastqc
conda install -c bioconda fastqc
INPUT
If you have many files, you can run it as a loop. For example, with fastq files:
#bash
for filename in *.fastq
do
fastqc $filename
done
OUTPUT
This html file can be opened through any web browser. A great resource to better understand the report can be found at https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html.
Trimmomatic (Bolger et al., 2014) is a read trimming tool that removes Illumina adapters from low-quality bases and adapter contamination. It’s important to do this step first to make sure you have cleaned and trimmed data going forward.
Installation instructions can be found at http://www.usadellab.org/cms/?page=trimmomatic.
INPUT
After installation, the java script needed to run Trimmomatic should be in a folder such as ‘Trimmomatic-0.36’. Go into your folder containing the sequence files and run Trimmomatic in a loop on all paired-end fastq files (here ending in R1.fastq and R2.fastq) using the following code:
#bash
for fileR1 in *R1.fastq
do
fileR2=`echo ${fileR1} | sed 's/R1/R2/'`
java -jar /PATH/TO/Trimmomatic-0.36/trimmomatic-0.36.jar PE $fileR1 $fileR2 $fileR1.tp.fastq $fileR1.tunp.fastq $fileR2.tp.fastq $fileR2.tunp.fastq ILLUMINACLIP:/PATH/TO/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:36
done
One common change made to the script is changing the values for ILLUMINACLIP, LEADING, TRAILING, SLIDINGWINDOW, and MINLEN. Please refer to the GitHub or primary literature to determine which values best fit your data.
OUTPUT
[Optional] Re-run FastQC on the trimmed sequence data to see if the quality improved from trimming!
This tutorial covers different sequence assembly methods including SPAdes (Prjibelski et al., 2020) for genome assembly using target-enriched sequence data, GetOrganelle (Jin et al., 2020) for chloroplast assembly using off-target reads from target-enriched sequence data, and Trinity (Grabherr et al., 2011) for transcriptome assembly of RNA sequence data.
SPAdes (Prjibelski et al., 2020) is a versatile toolkit designed for assembly and analysis of sequencing data, and was primarily developed for Illumina sequencing data.
Installation instructions can be found at https://github.com/ablab/spades.
INPUT
#bash
for trimR1 in `find ./ -name "*R1.fastq.tp.fastq"`
do
trimR2=`echo ${trimR1} | sed 's/R1/R2/'`
/PATH/TO/bin/spades.py -k 21,33,55,77,99 --only-assembler --pe1-1 $trimR1 --pe1-2 $trimR2 -o ./$trimR1.spades_output
done
# Rename spade output folder making it short having only sample name
for i in *;
do
mv "./$i" "./$(echo $i |grep ".fastq.spades_output"| awk '{split($0,a,"_");print a[1]}')"
done
# copy all the contig files with the sample name in a folder named ‘contigs’
mkdir ../contigs
for f in *;
do
cp "$f/contigs.fasta" "./contigs/$f.fasta";
done
OUTPUT
More information on output from SPAdes can be found at https://ablab.github.io/spades/output.html.
Trinity (Grabherr et al., 2011) represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.
Installation instructions can be found at https://github.com/trinityrnaseq/trinityrnaseq/wiki.
INPUT
#bash
#run trinity on all trimmed paired end files in a loop
for fileR1 in *R1.tp.fastq
do
fileR2=`echo ${fileR1} | sed 's/R1/R2/'`
Trinity --seqType fq --left $fileR1 --right $fileR2 --CPU 6 --max_memory 20G
done
OUTPUT
Organellar sequences may be extracted from off-target sequence reads and assembled using GetOrganelle (https://github.com/Kinggerm/GetOrganelle) (Jin et al., 2020). Often, there will be insufficient off-target sequence data for complete organellar assembly. The integrity of complete chloroplast assemblies may be assessed using Bandage (Wick et al., 2015), and then annotated using GeSeq (Tillich et al., 2017).
#bash
conda create --name getorganelle
conda activate getorganelle
conda install -c bioconda getorganelle
INPUT
#bash
conda activate getorganelle
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o plastome_output -R 15 -k 21,45,65,85,105 -F embplant_pt
Other parameters may be adjusted to support the success of a complete plastome assembly. We prefer to increase the number of rounds (-R), double the disentangle time limit (–disentangle-time-limit 7200), then possibly decrease the word size (-w) based on the default word size found in the log file. Refer to https://github.com/Kinggerm/GetOrganelle for more parameter adjustments.
OUTPUT
Check integrity of plastome assembly with Bandage (Wick et al., 2015) using the file “*.selected_graph.gfa” as input. Bandage can be downloaded as a GUI executable for Mac or Windows but can also be run in command-line on Linux systems using the following code from https://github.com/rrwick/Bandage/wiki/Command-line.
#bash
#Upload to your HPC
wget https://github.com/rrwick/Bandage/releases/download/v0.8.1/Bandage_Ubuntu_dynamic_v0_8_1.zip
unzip Bandage_Ubuntu_dynamic_v0_8_1.zip
INPUT
#bash
for file in *.p_ctg.gfa;
do
~/Bandage image $file $file.jpg;
done
OUTPUT
Figure 3.1: Bandage visualization of contig structure of plastome assembly for Amphoricarpos autariatus. A complete plastome assembly should show this structure representing the 4 quadripartite regions of the plastome: a large circle representing the large single copy region, a small circle representing the small single copy region, and a connecting line representing the two inverted repeat regions.
GeSeq (Tillich et al., 2017) is a web-based application that was developed for the rapid and accurate annotation of organelle genomes, in particular chloroplast genomes. Complete circular plastome assemblies may be annotated using GeSeq (Tillich et al., 2017) and visualized using OGDraw at https://chlorobox.mpimp-golm.mpg.de/geseq.html.
Figure 3.2: Annotated chloroplast of Amphoricarpos autariatus produced by OGDraw.
PHYLUCE (Faircloth, 2016) is a software package that determines orthology from targeted loci (e.g., from target-enrichment data). It is considered ‘ultra-conserved’ compared to other orthology determination software programs (e.g., HybPiper (Johnson et al., 2016)) since it removes any loci that are considered paralogous. If you have used HybPiper, you can skip down to the Model Selection section using the concatenated gene file output.
#bash
conda create --name phyluce
conda activate phyluce
conda install bioconda::phyluce
INPUT
NOTE: Make sure to change the file paths throughout and change the number of taxa to the appropriate number at each ‘–taxa’ flag when needed.
#bash
#activate the conda environment
conda activate phyluce
# Create empty log folder
mkdir ./log
# Generate *.lastz files for each contig from SPAdes.
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_match_contigs_to_probes --contigs ./contigs --probes cos_probes.fasta --output ./output --log-path ./log
#Resulting files are *.lastz files in the ‘output’ folder. These files only need to be made once and are only probe-specific and not species–specific.
# Create empty output directory, in our case it is named ‘taxon-set-all’
mkdir ./taxon-set-all/
#create the initial list of loci for each taxon
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_get_match_counts --locus-db ./output/probe.matches.sqlite --taxon-list-config datasets.conf --taxon-group 'subset4' --output ./taxon-set-all/all.conf --incomplete-matrix --log-path ./log
# extract FASTA data that correspond to the loci in all-taxa-incomplete.conf
python /home/USER/miniconda3/envs/phyluce/bin/phyluce_assembly_get_fastas_from_match_counts --contigs ./contigs --locus-db ./output/probe.matches.sqlite --match-count-output ./taxon-set-all/all.conf --incomplete-matrix ./taxon-set-all/all.incomplete --output ./taxon-set-all/all.fasta --log-path ./log
# Explode the monolithic FASTA
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_assembly_explode_get_fastas_file --input ./taxon-set-all/all.fasta --output exploded-fastas
# Then run the below code to get stats
for i in exploded-fastas/*.fasta;
do
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_assembly_get_fasta_lengths --input $i --csv;
done > fasta_lengths.csv
#the resulting csv file has summary stats on the FASTAS
#the column headers are: samples,contigs,total bp,mean length,95 CI length,min length,max length,median length,contigs >1kb
# Alignment without internal trimming BUT does edge trim
#CHANGE TAXON NUMBER
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_seqcap_align --input ./taxon-set-all/all.fasta --output ./taxon-set-all/mafft-nexus-trimmed/ --taxa 11 --aligner mafft --cores 4 --incomplete-matrix --log-path ./log
# Get basic stats on the alignments
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_get_align_summary_data --alignments ./taxon-set-all/mafft-nexus-trimmed/ --cores 4 --log-path ./log
# Remove locus name from alignments
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_remove_locus_name_from_files --alignments ./taxon-set-all/mafft-nexus-trimmed/ --output ./taxon-set-all/mafft-nexus-trimmed-clean/ --log-path ./log
# Generates individual gene matrix files that will be used for the pseudo-coalescent analysis
#CHANGE TAXON NUMBER
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_get_only_loci_with_min_taxa --alignments ./taxon-set-all/mafft-nexus-trimmed-clean --taxa 11 --percent 0 --output ./taxon-set-all/mafft-nexus-trimmed-clean-genes --cores 4 --log-path ./log
# Build the total concatenated data matrix from the gene matrices
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_concatenate_alignments --alignments ./taxon-set-all/mafft-nexus-trimmed-clean-genes --output ./taxon-set-all/mafft-nexus-trimmed-raxml --phylip --log-path ./log
# Converts nexus to phylip-relaxed file format
python /PATH/TO/miniconda3/envs/phyluce/bin/phyluce_align_convert_one_align_to_another --alignments ./taxon-set-all/mafft-nexus-trimmed-clean-genes --output ./taxon-set-all/mafft-fasta --input-format nexus --output-format phylip-relaxed --cores 1 --log-path ./log
OUTPUT
(Optional) Extracting data from the sqlite database produced by PHYLUCE PHYLUCE produces a sqlite database that contains two tables: “matches” and “match_map”. “matches” contains which loci were recovered for each taxon, while “match_map” has the name of the contig that matches each locus in each taxon. We can use the following commands to save these tables as csv files.
These next few steps will detail how to run PartitionFinder (Lanfear et al., 2016) to determine which evolutionary model best fit our data used to then generate gene trees.
Installation instructions can be found at www.robertlanfear.com/partitionfinder/tutorial.
INPUT
Using the charsets file, create a configuration (.cfg) file with information about the matrix, using the following command to make this file.
#bash
#designate the input file. Change if needed
input_file="mafft-nexus-trimmed-raxml.charsets"
#designate the output file. Change if needed
output_file="partition_finder.cfg"
# Start by writing the static part of the output
echo "alignment = mafft-nexus-trimmed-raxml.phylip;" > "$output_file" #Change the alignment name if needed
echo "branchlengths = linked;" >> "$output_file"
echo "models = all;" >> "$output_file"
echo "model_selection = AICc;" >> "$output_file"
echo "[data_blocks]" >> "$output_file"
# Process the input file to extract and format the charset lines
awk '/^charset/ {
split($2, a, "=");
gsub(/charset |\.nexus|'\''/, "", $2); # Remove charset, .nexus, and single quotes
print $2 " = " $4;
}' "$input_file" >> "$output_file"
# Add the [schemes] section to the output
echo "[schemes]" >> "$output_file"
echo "search = rcluster;" >> "$output_file"
Using the newly made .cfg file, run the following code:
#bash
/PATH/TO/PartitionFinder.py /PATH/TO/partition_finder.cfg --raxml --rcluster-max 100 --no-ml-tree
OUTPUT
When analyzing target-enriched sequence data for phylogenetic analyses, multiple approaches may be implemented. For a concatenated approach all gene sequences are combined into a single matrix. This method assumes that all trees share the same evolutionary history, and therefore may not always be ideal for identifying ancient speciation events where gene trees are highly divergent. This tutorial presents instructions for a concatenated phylogeny using RAxML (Stamatakis, 2014), a maximum likelihood phylogenetic inference tool. Phylogenetic trees may be visualized by an array of tools, eg. FigTree (http://tree.bio.ed.ac.uk/software/figtree/).
Installation instructions can be found at https://cme.h-its.org/exelixis/web/software/raxml/cluster.html.
INPUT
Note: Change the -m function depending on the best fitting model (GTR model: GTRCAT; GTR+G model: GTRGAMMA; GTR+I+G model: GTRGAMMAI).
#bash
/PATH/TO/raxmlHPC-PTHREADS-SSE3 -T $SLURM_CPUS_PER_TASK -f a -p 12345 -x 12345 -m GTRGAMMAI -# 100 -s /PATH/TO/mafft-nexus-trimmed-raxml.phylip -n out
OUTPUT
(#fig:concat_tree)Phylogenetic tree produced from RAxML analysis on a concatenated matrix visualized using FigTree.
In contrast to a concatenated approach, a pseudo-coalescent phylogenetic approach analyzes individual gene trees, accounting for the high probability that different genes likely have different evolutionary histories. This method may be better suited for resolving phylogenetic relationships in the presence of incomplete lineage sorting and hybridization, but is more computationally demanding than concatenated methods. One method to generate a coalescent phylogeny is by using Astral-III (Zhang et al., 2018) on a suite of RAxML (Stamatakis, 2014) gene trees constructed under different evolutionary models.
Installation instructions can be found at https://cme.h-its.org/exelixis/web/software/raxml/cluster.html.
When running RAxmL for a pseudo-coalescent (hereafter referred to as “ASTRAL”) analysis, individual loci should be analyzed separately using the most appropriate model as indicated in the results of PartitionFinder.
INPUT
To organize all loci files by model type into new directories, the following code can be utilized. In this code, loci files (eg. uce-*.nexus) will be moved into folders labeled “batch_exports”.
#bash
#make folders of each model type in main working directory
mkdir astral_raxml
cd astral_raxml
mkdir bestTree
mkdir bootTree
mkdir Astral_try
mkdir GTR
mkdir GTRG
mkdir GTRIG
cd GTR
mkdir GTR_batch_exports
mkdir output
cd ../GTRG
mkdir GTRG_batch_exports
mkdir output
cd ../GTRIG
mkdir GTRIG_batch_exports
mkdir output
#edit the best_scheme.txt file to be used for designating loci files to proper model folder
tail -n +22 best_scheme.txt| sed -n '/^$/q;p' > best_scheme_uce_table.txt
Move each loci folder to its designated model folder using the following R code. Change the uce location paths to match your own directory organization.
#R
df <- read.table(file = "best_scheme_uce_table.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows = 200000)
df$Best.Model <- gsub(" ", "", df$Best.Model, fixed = TRUE)
df$Partition.names <- gsub(" ", "", df$Partition.names, fixed = TRUE)
GTR <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR")],",")))
GTRG <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR+G")],",")))
GTRIG <- c(unlist(strsplit(df$Partition.names[which(df$Best.Model == "GTR+I+G")],",")))
model_GTR <- cbind(GTR, rep("GTR", n= length(GTR)))
model_GTRG <- cbind(GTRG, rep("GTRG", n= length(GTRG)))
model_GTRIG <- cbind(GTRIG, rep("GTRIG", n= length(GTRIG)))
model_df <- as.data.frame(rbind(model_GTR, model_GTRG, model_GTRIG))
colnames(model_df) <- c("uce", "model")
uce_original_location <- "/PATH/TO/taxon-set-all/mafft-nexus-trimmed-clean-100p/"
uce_model_location <- "/PATH/TO/astral_raxml/"
files <- list.files(uce_original_location)
for(i in 1:length(files)) {
file_name <- files[i]
file_uce <- gsub(".nexus", "", file_name)
file_model <- model_df$model[which(model_df$uce %in% file_uce)]
file.copy(from = paste(uce_original_location, file_name, sep = ""),to = paste(uce_model_location, file_model, "/", file_model, "_batch_exports", sep = ""))
}
After all the gene files have been moved into their appropriate model choice batch_export/ folder, the file type and ending will need to be changed from .nexus to .fasta. This can be done using the tool ElConcatenero (https://github.com/ODiogoSilva/TriFusion) using the following code.
#bash
python /home/USER/ElConcatenero/ElConcatenero.py -c -if nexus -of fasta -in *.nexus
RAxML can then be run on the individual, aligned gene files in a loop. The -m flag will need to be adjusted depending on the best fitting model (GTR model: GTRCAT; GTR+G model: GTRGAMMA; GTR+I+G model: GTRGAMMAI).
#bash
DIR_I=/PATH/TO/astral_raxml/GTR/batch_exports/*
for f in $DIR_I
do
echo "Processing $f"
file_name=$(basename $f)
/PATH/TO/raxmlHPC-PTHREADS-SSE3 -T 4 -f a -p 12345 -x 12345 -m GTRCAT -# 100 -s $f -n $file_name -w /PATH/TO/astral_raxml/GTR/output
done
OUTPUT
Output from these RAxML (Stamatakis, 2014) analyses will be used as input for the ASTRAL-III (Zhang et al., 2018) analysis.
Installation instructions can be found at https://github.com/smirarab/ASTRAL.
INPUT
After the output files are produced from RAxML, copy the RAxML_bestTree* and RAxML_bootstrap* files into their own folders. The file endings will then need to be changed and concatenated. The following code may be used to prepare files for the ASTRAL analysis.
#bash
#make new folders
cd /PATH/TO/astral_raxml/
mkdir bestTree
mkdir bootTree
#copy files to new folders
cp /PATH/TO/GTR/output/RAxML_bootstrap* ./bootTree/
cp /PATH/TO/GTR/output/RAxML_bestTree* ./bestTree/
#change fasta extension to .tre extension
cd bestTree
for f in *.fasta; do
mv -- "$f" "${f%.fasta}.tre"
done
cd bootTree
for f in *.fasta; do
mv -- "$f" "${f%.fasta}.tre"
done
#concatenate the .tre files
cd bestTree
cat RAxML_bestTree* > concat_best.tre
#make a new folder named Astral_try/ and copy the concatenated best tree file into it
cd /PATH/TO/astral_raxml/
mkdir Astral_try
cd Astral_try
cp ../bestTree/concat_best.tre .
(Optional) We have discontinued running bootstrap analyses since the creators of ASTRAL-III stated that LPP support values are more reliable than bootstrap, however the code is provided if you would like to use it. For a bootstrap analysis, you also need to make a bs-files file which designates where the RAxML_bootstrap*.tre files are located (e.g., bootTree/).
#example bs_files
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1000.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1005.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1007.tre
/PATH/TO/astral_raxml/bootTree/RAxML_bootstrap.uce-1008.tre
# remaining bootstrap file locations
Code to make this bs-files is below.
#bash
cd /PATH/TO/astral_raxml/bootTree/
ls *.tre > bs-files
pwd #copy the directory!
sed -i 's\RAxML\==/PATH/TO/astral_raxml/bootTree/==RAxML\g' bs-files #paste your working directory where it is highlighted!
mv bs-files ../Astral_try/
ASTRAL-III can now be run with the following code:
#bash
#Run ASTRAL
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar -i concat_best.tre -o Astral_best.tre 2> astral_best.log
#Generate LPP values
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar -q Astral_best.tre -i concat_best.tre -o Astral_lpp.tre 2> astral_lpp.log
#Get bootstrap values (optional)
java -jar /PATH/TO/ASTRAL/astral.5.7.3.jar -i concat_best.tre -b bs-files -o Astral_bootcount.tre 2> astral_bootcount.log
**OUTPUT
(#fig:astral_tree)Phylogenetic tree produced from ASTRAL-III analysis visualized using FigTree.
Another tool for generating phylogenetic trees within a coalescent framework is OrthoFinder (Emms and Kelly, 2019). This program identifies orthogroups, infers rooted trees for all orthogroups, identifies gene duplication events, and infers a rooted species tree.
#bash
# Install program with Bioconda
conda create --name orthofinder
conda activate orthofinder
conda install -c bioconda orthofinder
INPUT
#bash
/PATH/TO/OrthoFinder_source/orthofinder.py -f ./pep/ -t 64 -a 16 -M msa -T raxml
OUTPUT
Refer to https://davidemms.github.io/orthofinder_tutorials/exploring-orthofinders-results.html for more detailed description of results.
Phylogenetic trees from chloroplast sequence data can be generated in a number of different ways. This tutorial describes a method for plastome phylogenetic estimation using target-enriched sequencing data. As mentioned above in the plastome assembly section, there is often insufficient off-target sequence data for complete organellar assembly.
#bash
# Install program with Bioconda
conda create --name mapping
conda activate mapping
conda install bioconda::bowtie2
conda install bioconda::samtools
conda install bioconda::bcftools
conda install bioconda::seqtk
INPUT
This tutorial describes a process of mapping off-target sequence reads to a reference plastome in order to create consensus sequences which may then be aligned and used as input for phylogenetic analyses.
INPUT
First, a reference database will need to be constructed using Bowtie2. This can be done using the following code:
#bash
conda activate mapping
bowtie2-build reference_plastome.fasta reference_plastome
With the resulting reference database, the paired-end target-enriched sequence reads may be mapped onto the reference plastome. To do this in a loop with all reads with file endings “_1_tp.fastq” and “_2_tp.fastq”, use the following code:
#bash
conda activate mapping
#in your working directory, make folders for output of this pipeline
mkdir sam_files
mkdir bam_files
mkdir fastq
mkdir consensus_fastas
#run the pipeline in a loop on all paired files
for fileR1 in *_1_tp.fastq
do
fileR2=`echo ${fileR1} | sed 's/_1_/_2_/'`
filename=`echo ${fileR1} | sed 's/_1_.*//'`
#map reads to reference plastome
bowtie2 --very-sensitive -p 24 -x /PATH/TO/reference_plastome -1 "$fileR1" -2 "$fileR2" -S /PATH/TO/sam_files/"$filename".sam
#convert sam files to bam files
cat /PATH/TO/sam_files/"$filename".sam | samtools view -bS - | samtools sort -o /PATH/TO/bam_files/"$filename".bam
#Get consensus fastq file
bcftools mpileup -f /PATH/TO/reference_plastome.fasta /PATH/TO/bam_files/"$filename".bam | bcftools call -c | vcfutils.pl vcf2fq > /PATH/TO/fastq/"$filename"_cns.fastq
#vcfutils.pl is part of bcftools
#Convert .fastq to .fasta
seqtk seq -a /PATH/TO/fastq/"$filename"_cns.fastq > /PATH/TO/consensus_fastas/"$filename"_cns.fastq
done
The scaffold names need to be changed to match the name of the sample.
#R
files <- list.files(pattern= "fasta")
for (i in 1:length(files)) {
# 2. Set file name
f <- files[i]
# 3. Load fasta
fasta <- readLines(f)
# replace scaffold name with sample id
id <- gsub("_cns.fasta", "", f)
new_scaf_name <- paste0(">", id)
fasta[1] <- new_scaf_name
# 7. Overwrite consensus file with new scaffold names
write.table(fasta, paste0(id, "_cns.fasta"), row.names = F, col.names=F, quote = F)
}
# 8. Quit R
quit(save="no")
Now all of the consensus files can be concatenated and aligned for phylogenetic analysis. In this tutorial, MAFFT (Katoh and Standley, 2013) will be used to align sequences.
Installation instructions can be found at https://mafft.cbrc.jp/alignment/software/source.html.
Concatenate all sequences and align them using MAFFT.
#bash
cd /PATH/TO/consensus_fastas
cat *.fasta > all_cns.fasta
mafft all_cns.fasta > all_cns_aligned.fasta
Phylogenetic analyses may be run on the aligned sequences in multiple ways. One user-friendly way is by using IQTREE (Minh et al., 2020). Comprehensive instructions are provided at http://www.iqtree.org/doc/iqtree-doc#beginners-tutorial.
#bash
conda create --name iqtree
conda activate iqtree
conda install -c bioconda iqtree
INPUT
#bash
conda activate iqtree
iqtree -s all_cns_aligned.fasta -B 1000
OUTPUT
(#fig:cp_tree)Phylogenetic tree produced from RAxML analysis on a concatenated matrix visualized using FigTree.
Resulting phylogenetic trees may be visualized by an array of tools, like FigTree (http://tree.bio.ed.ac.uk/software/figtree/) or through R packages such as phytools (Revell, 2012) and ape (Paradis and Schliep, 2019). This tutorial describes how to compare and visualize phylogenetic trees to identify incongruent or divergent evolutionary histories using phytools and ape.
Installation instructions are located at https://evomics.org/resources/software/molecular-evolution-software/figtree/.
INPUT
OUTPUT
#R
R
install.packages("phytools")
install.packages("ape")
INPUT
#R
R
library(phytools)
library(ape)
tree1 <- read.tree("concatenated.tre")
tree2 <- read.tree("coalescent.tre")
#root both trees to the outgroup
tree1 <- root(tree1,"outgroup", resolve.root = T)
tree2 <- root(tree2,"outgroup", resolve.root = T)
#compare both trees
obj<-cophylo(tree1, tree2, fn=function(x) abs(x),print=TRUE, use.edge.length = TRUE)
#plot a tanglegram comparing both trees
pdf("concat_v_coal_tree.pdf")
plot(obj,link.type="curved",link.lwd=2,link.lty="solid",
link.col=make.transparent("blue",0.25),fsize=0.6)
title(main= "Concatenated vs. Coalescent Nuclear Tree", cex.main= .75, line = -.79)
dev.off()
For more detailed instructions on how to compare phylogenetic trees, refer to http://blog.phytools.org/2016/08/empirical-co-phylogenetic-cophylo.html.
OUTPUT
Figure 4.1: Comparison of concatenated and coalascent trees produced using phytools and ape in R
Phylogenomic incongruence is routinely observed in phylogenomics studies and typically results from underlying gene tree discordance. Below are tutorials on how to investigate the extent of gene tree discordance among phylogenies using PhyParts (Smith et al., 2015), which summarizes how many gene trees agree/disagree with the final species topology, and for Quartet Sampling (Pease et al., 2018), which uses quartets to investigate the cause of low support at each node.
PhyParts (Smith et al., 2015), along with PhypartsPieCharts (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts), are tools that indicate the percentage of concordant gene trees, percentage in the top alternative bipartition, other conflicting topologies, and missing and uninformative genes as pie charts on the nodes of a species phylogeny. This provides a visualization of discordant nodes resulting from gene tree conflict. New python scripts have also been produced that show missing and uninformative data as two separate pie slices (see https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/), providing an even better visualization of the underlying data. Below we will show how to run PhyParts on the ASTRAL tree with the original PhypartsPieCharts script, but also with this new python script that has the extra slice.
To prepare the input data for PhyParts, Phyx (Brown et al., 2017) can be used to root the species and gene trees.
Installation instructions are located at https://github.com/FePhyFoFum/phyx.
INPUT
Phyx (Brown et al., 2017) is a very helpful phylogenetic tool that has many functions. For the purposes of running PhyParts, only the ‘rerooting and unrooting tree’ function, pxrr, needs to be used. Designating the proper taxa/taxon as outgroups for rooting, the following code may be used:
#bash
#for a single taxon rooting
pxrr -t Astral_best.tre -g outgroup_taxon > Astral_best_rooted.tre
#for rooting to multiple taxa
pxrr -t Astral_best.tre -g outgroup_taxon1 outgroup_taxon2 outgroup_taxon3 … > Astral_best_rooted.tre
#running pxrr on the gene trees in a loop
for t in *.tre
do
pxrr -t $t -g outgroup_taxon > $t.rooted.tre
done
#Not all gene trees may include your outgroup taxon, so the number of gene trees may decrease from the original starting point. The empty gene trees must be removed. A quick, one-liner code for this is:
find . -maxdepth 1 -type f -empty -print -delete
OUTPUT
General installation instructions can be found at https://bitbucket.org/blackrim/phyparts/src/master/, however, code is below to make a conda environment called ‘phyparts’. Additionally, the GitHub repository needs to be cloned to obtain the python scripts to run PhyParts.
#bash
conda create -n phyparts python=2.7
pip install ete3
pip install matplotlib
git clone https://bitbucket.org/blackrim/phyparts.git
INPUT
#bash
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d /PATH/TO/phyluce/PhyParts/root/ -m /PATH/TO/Astral_best_rooted.tre -o output
OUTPUT
Detailed information on the output of PhyParts can be found at https://bitbucket.org/blackrim/phyparts/src/master/.
To create figures for publication, Matt Johnson (Texas Tech University) generated a visualization script that can be used, ‘phypartspiecharts.py’ from https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts, to produce svg images.
INPUT
#bash
#copy python code from github
git clone https://github.com/mossmatters/MJPythonNotebooks.git
#activate conda environment
conda activate phyparts
#combine all outfiles from PhyParts
cat *.concord.node.* > phyparts.concord.tre
cat *.conflict.node.* > phyparts.conflict.tre
cat *.hist > phyparts.hist
cat *.hist.alts > phyparts.hist.alts
cat *.node.key > phyparts.node.key
cat *.
concon.tre > phyparts.concon.tre
#Now you just need the rooted species tree (Astral_best_rooted.tre), what the different outfiles can be referred to as (e.g., phyparts), and knowledge on how many gene trees were used (e.g., 1000 gene trees).
/PATH/TO/phyluce/PhyParts/phypartspiecharts.py /PATH/TO/phyluce/PhyParts/Astral_best_rooted.tre phyparts 1000 --svg_name phyparts_piechart.svg
OUTPUT
Figure 5.1: Phylogenetic discordance visualized using PhyParts
Alternatively, you can run a new python script from Diego Morales-Briones, ‘phypartspiecharts_missing_uninformative.py’ from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/, to separate the missing and uninformative data into two separate pie charts. To do this, you need to re-run PhyParts as two separate runs. Additionally, you need to use the bootstrap gene trees instead of the best trees and root them with Phyx as described above.
INPUT
#bash
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d /astral_raxml/bootTree/root/ -m Astral_best_rooted.tre -o concat50_noBS
java -jar /PATH/TO/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -s 50 -d /astral_raxml/bootTree/root/ -m Astral_best_rooted.tre -o concat50_BSfull #NOTE: 50 is for 50 BS support. You can alter this number or use different support if using LPP or PP to better fit your data.
From there, you will want to go to R and follow the script provided that alters the *concon.tre.
#R
#Read in trees
read.tree("astral50_BSfull.concon.tre")->bs_full_concordance
read.tree("astral50_noBS.concon.tre")->No_bs
#run code
total_no_bs<-No_bs[[1]] # get a tree to add total node numbers
total_no_bs$node.label<-mapply("+",as.numeric(No_bs[[1]]$node.label), as.numeric(No_bs[[2]]$node.label)) #get total number of nodes
total_no_bs$node.label[is.na(total_no_bs$node.label)] <- "" #remove NA values
total_no_bs$node.label[total_no_bs$node.label=="0"]<-"" #remove 0 values. to avoid divisions by zero.
append(bs_full_concordance, total_no_bs, after=2)-> full_concordance_and_total_nodes
#write tree
write.tree(full_concordance_and_total_nodes, file = "astral50_all_homologs.concon.tre") #write tree - this will replace the orignal *.concon.tre file.
Upload this new .concon.tre file to where you ran PhyParts and rename it to match the original run.
#bash
#rename tree to match the old run
mv astral50_all_homologs.concon.tre astral50_BSfull.concon.tre
#Run the new python script now on the data, following the same guidelines as phyparts_piechart.py:
python phypartspiecharts_missing_uninformative.py Astral_best_rooted.tre astral50_BSfull 1000 #change 1000 to however many gene trees were used as input
OUTPUT
(#fig:phyparts_new)Phylogenetic discordance visualized using PhyParts with the new python script adding an additional pie slice separating uninformative to informative genes
Quartet Sampling (QS) (Pease et al., 2018) is a quartet-based method designed to assess the confidence, consistency, and informativeness of internal tree relationships, and the reliability of each terminal branch. QS determines if a phylogeny has a lack of support due to insufficient information, discordance from ILS or introgression, or misplaced or erroneous taxa. Three scores are calculated (QC, QD, QI) for each internal branch of the focal tree which together allows different interpretations of the branches, as well as a Quartet Fidelity (QF) score that reports how frequently a taxon is included in concordant topologies. QS efficiently synthesizes phylogenetic support tests and offers more comprehensive and specific information on branch support than conventional measures (i.e., bootstrap).
#bash
#create a conda environment for installation
conda create -n quartetsampling numpy
conda activate quartetsampling
conda install -c bioconda python
conda install -c bioconda raxml-ng
#clone the GitHub to obtain the python scripts to run Quartet Sampling
git clone https://github.com/FePhyFoFum/quartetsampling.git
INPUT
Quartet sampling may be run using the following code with the default parameter values. For more information on adjusting the parameters see https://github.com/FePhyFoFum/quartetsampling/blob/master/quartetsampling.pdf.
#bash
python /PATH/TO/quartetsampling/pysrc/quartet_sampling.py --tree /PATH/TO/Astral_best.tre --align /PATH/TO/mafft-nexus-trimmed-raxml.phylip --reps 100 --threads 4 --lnlike 2
OUTPUT
Each file contains useful information about the different scores generated by QS. The final tree file, RESULT.labeled.tre.figtree, provides all values (QC, QI, QD, and QF). This file can be put into FigTree to visualize discordance.
(#fig:qs_tree)Quartet Sampling results of ML concatenated tree. The Quartet Concordance (QC), Quartet Differential (QD), and Quartet Informativeness (QI) scores are found at each node as QC/QD/QI.
Phylogenetic trees represent evolutionary relationships as a branching structure, however, in reality, evolutionary processes are more complex, including processes such as hybridization. Unlike trees, phylogenetic networks can model reticulation by incorporating nodes representing gene flow or hybridization events and may offer a more accurate representation of evolutionary histories. To investigate ancient hybridization, this tutorial will provide instructions for using PhyloNetworks (Solís-Lemus et al., 2017), a Julia package developed to facilitate the inference of networks from genetic data, manipulation, visualization, and analysis of network features.
PhyloNetworks (Solís-Lemus et al., 2017) is computationally very intense. Given this, it is often suggested to subset your species and gene trees into smaller trees and run those individually if you have a large tree. For more detailed information on running and visualizing phylogenetic networks see https://juliaphylo.github.io/PhyloNetworks.jl/stable/.
#julia
julia
using Pkg
Pkg.add("PhyloNetworks")
Pkg.add("RCall")
Pkg.add("PhyloPlots")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("StatsModels")
Pkg.add("Gadfly")
Pkg.add("Plots")
INPUT
#julia
julia
using PhyloNetworks
using PhyloPlots
using CSV
using DataFrames
using StatsModels
using Gadfly
using RCall
using Plots
#load your gene tree matrix and species tree
genetrees = readMultiTopology("concat_best.tre")
q,t = countquartetsintrees(genetrees);
df = writeTableCF(q,t)
CSV.write("tableCF.csv", df);
raxmlCF = readTableCF("tableCF.csv")
astraltree = readMultiTopology("Astral_lpp.tre")
T=readTopologyLevel1("Astral_lpp.tre")
#start with 0-10 potential hybridizations on your PhyloNetworks run.
net0 = snaq!(astraltree[1], raxmlCF, hmax=0, filename="net0", seed=1234)
net1 = snaq!(net0, raxmlCF, hmax=1, filename="net1", seed=2345)
net2 = snaq!(net1, raxmlCF, hmax=2, filename="net2", seed=3456)
net3 = snaq!(net2, raxmlCF, hmax=3, filename="net3", seed=4567)
net4 = snaq!(net3, raxmlCF, hmax=4, filename="net4", seed=5678)
net5 = snaq!(net4, raxmlCF, hmax=5, filename="net5", seed=6789)
net6 = snaq!(net5, raxmlCF, hmax=6, filename="net6", seed=6789)
net7 = snaq!(net6, raxmlCF, hmax=7, filename="net7", seed=6789)
net8 = snaq!(net7, raxmlCF, hmax=8, filename="net8", seed=6789)
net9 = snaq!(net8, raxmlCF, hmax=9, filename="net9", seed=6789)
net10 = snaq!(net9, raxmlCF, hmax=10, filename="net10", seed=6789)
#determine the best number of hybridizations by plotting teh log likelihood scores
#reload output files
net0 = readTopology("net0.out")
net1 = readTopology("net1.out")
net2 = readTopology("net2.out")
net3 = readTopology("net3.out")
net4 = readTopology("net4.out")
net5 = readTopology("net5.out")
#make a list of the scores
scores = [net0.loglik, net1.loglik, net2.loglik, net3.loglik, net4.loglik, net5.loglik]
#plot the scores and save
R"pdf"("score-vs-h.pdf", width=4, height=4);
R"plot"(x=0:5, y=scores, type="b", xlab="number of hybridizations h", ylab="network score");
R"dev.off"();
After plotting the log likelihood scores, the optimal number of hybridizations can be estimated by identifying the lowest p-value. If the data plateaus, then choose the first value at the plateau.
After identifying the best number of hybridizations without bootstraps, PhyloNetworks can be run with bootstraps. This step is computationally limiting step and can take days to weeks to months to finish.
#julia
julia
using PhyloNetworks
using PhyloPlots
using CSV
using DataFrames
using StatsModels
using Gadfly
using RCall
using Plots
#load your gene tree matrix and species tree
net2 = readTopology("net2.out")
bootTrees = readBootstrapTrees("bs-files")
bootnet2 = bootsnaq(net2, bootTrees, hmax=2, nrep=100, runs=10, filename="bootsnaq2_raxmlboot", seed=2345)
The final phylonetwork can be visualized using:
#julia
julia
using PhyloNetworks, PhyloPlots, RCall
#read in network and bootstrap files
net2 = readTopology("net2.out")
bootnet2 = readMultiTopology("bootsnaq2_raxmlboot.out")
#visualize the network with edge and node labels
imagefilename = "phylonetwork_edges_nodes.pdf"
R"pdf"(imagefilename, width=8, height=6)
plot(net2, showedgenumber=true, shownodenumber=true);
R"dev.off()"
#root the network on the edge leading to the outgroup (edge=3)
rootonedge!(net2, 3)
#visualize the rooted network with edge and node labels
imagefilename = "phylonetwork_edges_nodes_rooted.pdf"
R"pdf"(imagefilename, width=8, height=6)
plot(net2, showedgenumber=true, shownodenumber=true);
R"dev.off()"
#rotate the branches so that hybrid edges do not cross, (rotate on node 11 and -5)
rotate!(net2, 11)
rotate!(net2, -5)
#visualize the rooted, rotated network with hybrid edge values
imagefilename = "phylonetwork_rooted_rotated.pdf"
R"pdf"(imagefilename, width=16, height=8)
plot(net2, showgamma=true);
R"mtext"("Phylonetwork Net2", line=-1, cex =2);
R"dev.off()"
#add bootstrap support
BSe_tree_net2, tree1 = treeEdgesBootstrap(bootnet2, net2)
imagefilename = "net2_bs.pdf"
R"pdf"(imagefilename, width=16, height=8)
plot(net2, edgelabel = BSe_tree_net2, showgamma=true, edgecolor="black");
R"mtext"("Phylonetwork Net2", line=-1, cex =2);
R"dev.off()"
Figure 6.1: Phylonetwork
Mutiple methods may be used to infer whole genome duplication events in the evolutionary history of organisms. Most often, Ks-based or synteny-based methods are used. This tutorial will provide a guide for Ks-based analyses to infer WGD from transcriptomic data.
WGDv2 (Chen et al., 2024) is a package used to infer and place dates on ancient whole-genome duplication (WGD) events. This tutorial describes using wgd v2 exclusively to generate Ks plots to estimate the evolutionary age of WGD events using transcriptome data that have first been translated into coding sequences using TransDecoder (https://github.com/TransDecoder/TransDecoder).
Installation instructions can be found at https://github.com/TransDecoder/TransDecoder. Additionally, it can be installed using Singularity.
#bash
wget https://data.broadinstitute.org/Trinity/CTAT_SINGULARITY/MISC/TransDecoder/transdecoder.v5.7.1.simg
#to run with singularity, enter this before the command
singularity exec -e ~/TransDecoder/transdecoder.v5.7.1.simg
INPUT
#bash
#extract the long open reading frames
TransDecoder.LongOrfs -t assembled_transctipts.fasta --output_dir output
#predict the likely coding regions
TransDecoder.Predict -t assembled_transctipts.fasta --output_dir output
OUTPUT
The wgdV2 repository can be cloned from github to create a conda environment.
#bash
#clone repository
git clone https://github.com/heche-psb/wgd
#create conda environment and install programs
conda create -n wgdv2 python=3.8
conda activate wgdv2
pip install numpy==1.19.0
pip install wgd==2.0.38
conda install diamond #future dependencies
conda install bioconda::mafft #future dependencies
conda install bioconda::mcl #future dependencies
INPUT
#bash
#Construction of age distribution and ELMM analysis
wgd dmd transcripts.transdecoder.cds -o wgd_dmd
wgd ksd wgd_dmd/transcripts.transdecoder.cds.tsv transcripts.transdecoder.cds -o wgd_ksd
wgd viz -d wgd_ksd/transcripts.transdecoder.cds.tsv.ks.tsv -o wgd_ELMM
OUTPUT
Figure 7.1: Ks plot for Schlechtendalia luzulifolia produced using wgdV2
Citations of all R packages used to generate this report.
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. https://github.com/rstudio/rmarkdown.
[2] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.
[3] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2023. https://www.R-project.org/.
[4] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman and Hall/CRC, 2016. ISBN: 978-1138700109. https://bookdown.org/yihui/bookdown.
[5] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.41. 2024. https://github.com/rstudio/bookdown.
[6] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.
[7] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.
[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.49. 2024. https://yihui.org/knitr/.
[9] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.
[10] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.