Usage
If you installed tiara using our instructions, you have to make sure
you are in the environment you created. So before running tiara you have to run conda activate tiara-env
,
assuming that you’ve named the environment tiara-env
.
Basic usage:
tiara -i sample_input.fasta -o out.txt
The sequences in the fasta file should be at least 3000 bases long (default value). We do not recommend classifying sequences shorter than 1000 base pairs.
It creates two files:
- out.txt, a tab-separated file with header
sequence id, first stage classification result, second stage classification result
. - log_out.txt, containing model parameters and classification summary.
Advanced:
tiara -i sample_input.fasta -o out.txt --tf mit pla pro -t 4 -p 0.65 0.60 --probabilities
In addition to creating the files above, it creates, in the folder where tiara
is run,
three files containing sequences from sample_input.fasta
classified as
mitochondria, plastid and prokarya (--tf mit pla pro
option).
The number of threads is set to 4 (-t 4
) and probability cutoffs
in the first and second stage of classification are set to 0.65 and 0.6, respectively.
The probabilities of belonging to individual classes are also
written to out.txt
, thanks to --probabilities
option.
Program options (any order is fine)
-i input
,--input input
A path to an input fasta file. Can be gzipped (identified by .gz suffix).-o output
,--output output
A path to output file. If not provided, the result is printed to stdout.-m MIN_LEN
,--min_len MIN_LEN
Minimum length of a sequence. Default: 3000.-p cutoff [cutoff ...]
,--prob_cutoff cutoff [cutoff ...]
Probability threshold needed for classification to a class. If two floats are provided, the first is used in a first stage, the second in the second stage Default:[0.65, 0.65]
.--to_fasta class [class ...]
,--tf class [class ...]
Write sequences to fasta files specified in the arguments to this option. The arguments are:mit
- mitochondria,pla
- plastid,bac
- bacteria,arc
- archea,euk
- eukarya,unk
- unknown,pro
- prokarya,all
- all classes present in input fasta (to separate fasta files).-t THREADS
,--threads THREADS
Number of threads used.--probabilities
,--pr
Whether to write probabilities of individual classes for each sequence.-v
,--verbose
Whether to display some additional messages and progress bar during classification.--gz
,--gzip
gzip all program outputs (adds.gz
extension to-o
).Advanced options
--first_stage_kmer
,--k1
k-mer length used in the first stage of classification. Default: 6. Available options:4, 5, 6
.--second_stage_kmer
,--k2
k-mer length used in the second stage of classification. Default: 7. Available options:4, 5, 6, 7
.
Running training scripts
First, clone and go to the repo:
git clone https://github.com/ibe-uw/tiara.git
cd tiara
TF-IDF
You can train your own TF-IDF models using a following command (run from the repo):
python -m tiara.training.train_tfidf <direcory with input fasta files> <output directory>
Input fasta files have to include archaea.fasta
, eukarya.fasta
,
mitochondria.fasta
, plastids.fasta
and bacteria.fasta
.
Editing the script to suit other needs should be easy.
Training neural networks
Run
python -m tiara.training.train_models <input dir> <output dir> <n_cores>
This will produce models for best parameters for each k-mer length (4-6
for the first stage, 4-7
for the second).
The <input dir>
has to have the same structure as the output dir.
Hyperparameter search
The input fasta directory has to have two directories: train
and validation
.
Each one has to have either all types of inputs described above or only mitochondria.fasta
and plastids.fasta
,
depending on the stage of classification.
Hyperparameter search using the default TF-IDF models can be done calling:
python -m tiara.training.hyperparameter_search_first_stage <input dir> <output filename> <kmer length> <n_cores>
Similarly for tiara.training.hyperparameter_search_second_stage
.
The output file includes a confusion matrix and some statistics
like accuracy, F1 etc calculated at the end of the default 50-epoch training.
There is also additional file produced (with histories_
prefix added), with statistics calculated after each epoch.
Using tiara as a package
Although most of the code is specific to our tool,
the TfidfWeighter
class could be useful for any researcher wanting to represent
nucleotide sequences with oligonucleotide frequency (basically bag-of-words),
weighted with tf-idf weighting.
The algorithm is explained in the docstrings. We provide several models for different k-mer lengths
in the tiara/models/tfidf-models
folder.
The basic usage would be:
>>> from tiara import TfidfWeighter
# to read some more documentation
>>> help(TfidfWeighter)
# k means k-mer length
# sequences longer than fragment_len will be splitted to fragments
# so they will be treated as separate words in tf-idf
>>> tfidf = TfidfWeighter(k=5, fragment_len=5000)
>>> tfidf.fit(["list", "of", "fasta", "files"]) # or filename of a single fasta file
>>> transformed_sequences = tfidf.transform(["ACCGTTTGCAC", "AACGCGACGTGCGAGTTT"]) # or a single nucleotide sequence
# saving the model
# it creates two files: params.txt with hyperparameters of the model (k, fragment_len etc)
# and model.npy, a numpy array containing idf vector
# in a directory specified in the argument
# if no directory is specified it will generate the name automatically,
# based on data names used to train the model and other parameters
>>> tfidf.save_params("model_directory")
# loading the model
>>> tfidf = TfidfWeighter.load_params("model_directory")
There are also a couple of functions which calculate the oligonucleotide frequency of a sequence, or a set of sequences.
>>> from tiara import oligofreq, single_oligofreq, multiple_oligofreq
>>> sample_sequence = "AGCTGCGCGACGCGAGCGTGCGCT"
# the oligofreq function has a simple implementation, works on any kmer, but is slower than the other ones
>>> oligofreq(sample_sequence, 2)
array([0., 1., 2., 0., 0., 0., 6., 2., 2., 7., 0., 1., 0., 0., 2., 0.],
dtype=float32)
# the functions single_oligofreq and multiple_oligofreq are faster than oligofreq
# but they utilize a "hardcoded" dictionary that tells the position of each kmer in an array
# so they work on kmer lengths from 1 to 7
# they utilize numba under the hood
>>> single_oligofreq(sample_sequence, 2)
array([0., 1., 2., 0., 0., 0., 6., 2., 2., 7., 0., 1., 0., 0., 2., 0.],
dtype=float32)
>>> multiple_oligofreq([sample_sequence, sample_sequence], 2)
array([[0., 1., 2., 0., 0., 0., 6., 2., 2., 7., 0., 1., 0., 0., 2., 0.],
[0., 1., 2., 0., 0., 0., 6., 2., 2., 7., 0., 1., 0., 0., 2., 0.]],
dtype=float32)