Computing sequence summary statistics

segul generates different summary statistics for DNA and amino acid sequences. By default, the datatype is set to DNA sequence. In general, the command is as below:

segul summary <input-option> [alignment-path] --input-format [sequence-format-keyword] --datatype [datatype]

The summary function produces three summary statistics:

Summary statistics for all the alignments printed in the terminal and written to the log file (segul.log).
Summary statistics for each alignment/locus written to a csv file (default name: locus_summary.csv).
Summary statistics for each taxon written to a csv file (default name: taxon_summary.csv).

Learn more about specifying the output directory and filenames here.

Computing sequence summary statistics for DNA sequences

Because segul datatype is default to DNA, we don't need to pass the --datatype option in the command. For example, to generate summary statistics for alignments in the folder alignments/:

segul summary -d alignments/ -f nexus

If we use the --input or -i option, the command will be:

segul summary -i alignments/*.nexus

Below is an example of segul terminal output for DNA sequence summary statistics. This output is based on alignments from Oliveros et al. (2019). An example of csv file generated from the same dataset can be found through this link.

=========================================================
SEGUL v0.11.1
An alignment tool for phylogenomics
---------------------------------------------------------
Input dir         : oliveros_et_al_2019/
File counts       : 4,060
Input format      : Nexus
Data type         : DNA
Task              : Sequence summary statistics

🌘 Finished computing summary stats!

General Summmary
Total taxa        : 221
Total loci        : 4,060
Total sites       : 2,464,926
Missing data      : 38,227,233
%Missing data     : 7.32%
GC content        : 0.36
AT content        : 0.56
Characters        : 522,529,858
Nucleotides       : 484,302,625

Alignment Summmary
Min length        : 155 bp
Max length        : 1,410 bp
Mean length       : 607.12 bp

Taxon Summmary
Min taxa          : 177
Max taxa          : 221
Mean taxa         : 210.84

Character Count
?                 : 36,681,846
-                 : 1,545,387
A                 : 147,184,543
C                 : 94,814,080
G                 : 94,526,406
T                 : 147,777,596

Data Matrix Completeness
100% taxa         : 15
95% taxa          : 3,069
90% taxa          : 3,729
85% taxa          : 3,961
80% taxa          : 4,060

Conserved Sequences
Con. loci         : 0
%Con. loci        : 0.00%
Con. sites        : 1,261,559
%Con. sites       : 0.51%
Min con. sites    : 16
Max con. sites    : 885
Mean con. sites   : 0.51

Variable Sequences
Var. loci         : 4,060
%Var. loci        : 100.00%
Var. sites        : 1,203,367
%Var. sites       : 0.49%
Min var. sites    : 15
Max var. sites    : 814
Mean var. sites   : 0.49

Parsimony Informative
Inf. loci         : 4,060
%Inf. loci        : 100.00%
Inf. sites        : 811,688
%Inf. sites       : 0.33%
Min inf. sites    : 2
Max inf. sites    : 631
Mean inf. sites   : 0.33

Output Files
Alignment summary : locus_per_locus.csv
Log file          : segul.log

Execution time    : 4.3725607s

Computing sequence summary statistics for amino acid sequences

To compute the summary statistics for amino acid sequences, we need to use the --datatype aa option. For example:

segul summary -d alignments/ -f nexus --datatype aa

If we use the --input or -i option, the command will be:

segul summary -i alignments/*.nexus --datatype aa

Setting up data matrix completeness interval

By default, segul will print the percentage of data matrix completeness with decrement interval 5 percent. It starts from 100% until it reaches all alignment coverage or near zero percent completeness. With the dafault interval, if segul never reaches all alignment coverage, it will stop printing the result when the result reaches 5%. In the Oliveros et al. (2019) dataset above, segul stops printing the data matrix completeness at 80% because it already cover the total number of alignments (4,060 alignments).

To change the interval setting use the --interval option. For example:

segul summary -i alignments/*.nexus --interval 1

segul support interval 1, 2, 5, and 10. Using Oliveros et al. (2019) dataset, the data matrix completeness result will be as below:

Data Matrix Completeness
100% taxa         : 15
99% taxa          : 520
98% taxa          : 1,219
97% taxa          : 1,953
96% taxa          : 2,496
95% taxa          : 3,069
94% taxa          : 3,301
93% taxa          : 3,445
92% taxa          : 3,548
91% taxa          : 3,636
90% taxa          : 3,729
89% taxa          : 3,786
88% taxa          : 3,841
87% taxa          : 3,880
86% taxa          : 3,908
85% taxa          : 3,961
84% taxa          : 3,980
83% taxa          : 4,005
82% taxa          : 4,021
81% taxa          : 4,050
80% taxa          : 4,060

Specifying the output directory and filenames

By default, the two csv files are saved in SEGUL-Stats directory. You can change the directory name by using --output or -o option. For example:

segul summary -d alignments/ -f nexus -o alignment_stats

You can also add prefix to the csv filenames using --prefix option. For example:

segul summary -d alignments/ -f nexus -o alignment_stats --prefix my_alignment

The command above will crate a directory name alignment_stats/ and write the csv output files in it. Using the --prefix option, the output filename for taxon summary will be my_alignment_taxon_summary.csv and for the locus summary will be my_alignment_locus_summary.csv. Note that, as mention above, the summary stats for all alignment will be written to the log file segul.log.

Installation

CLI Usage

GUI Usage

Developer Resources

Computing sequence summary statistics

Computing sequence summary statistics for DNA sequences

Computing sequence summary statistics for amino acid sequences

Setting up data matrix completeness interval

Specifying the output directory and filenames

On this page