The bistro algorithm
bistro is an algorithm, an R package, and a function within said R package. It employs methods from forensics to identify matches between bloodmeals and the people they bit using short tandem repeat (STR) profiles of human blood from freshly fed bloodmeals and from people. It can be used to match multi-source bloodmeals and bloodmeals with incomplete STR profiles. Note that while you can use it for matching any STR profiles for research purposes, we very much do not recommend using it for forensic purposes. For more details about the algorithm, please refer to the bistro manuscript (will add link once it’s posted).
A core part of the algorithm is the contLikSearch()
function from the euroformix
package, which is used to
calculate log10 likelihood ratios (log10LRs). The numerator of the
log10LR is the likelihood that the person was bitten by the vector and
the denominator is the likelihood that someone else was bitten by the
vector. Here is more information about euroformix: - Manuscript: EuroForMix: An open
source software based on a continuous model to evaluate STR DNA profiles
from a mixture of contributors with artefacts - GitHub - Website that explains GUI
bistro uses the log10LRs to identify bloodmeal-human matches using a per-bloodmeal dynamic threshold approach.
Running bistro in parallel
First, if you want to run bistro on many samples and would prefer to run it in parallel, potentially on a cluster, check out our template bistro Snakemake pipeline.
Installing the bistro
R package
To install bistro
, run the following commands:
install.packages("remotes") # install remotes if needed
remotes::install_github("duke-malaria-collaboratory/bistro")
If you run into errors, try the following:
- Windows: install Rtools
- Mac: install Xcode from the Mac App store
Getting started with bistro
The most important function in bistro
is
bistro()
. This document will take you through all of the
required and optional inputs, as well as the output.
In summary, you must provide:
- Bloodmeal STR profiles
- Human STR profiles
- The STR genotyping kit name
- An allele peak height threshold (in relative fluorescence units, RFUs) for the bloodmeal STR profiles
The function outputs matches for each bloodmeal-human pair.
To follow along with the vignette, first load
bistro
:
Note that when we load bistro
, we also load euroformix
because bistro
depends on the euroformix
package.
Required inputs
Before we run bistro()
using example data, let’s learn
more about the required inputs.
Two required, and one optional but often recommended, datasets are
required to run bistro()
: bloodmeal and human STR profiles
are required, and human population allele frequencies are optional. The
other two required inputs are the STR genotyping kit name and the
bloodmeal STR allele peak height threshold.
Bloodmeal STR profiles (bloodmeal_profiles
)
The bloodmeal STR profiles dataset should have one row per allele for
each bloodmeal and marker, and include the allele peak height.
Homozygous markers should only be included once (i.e. as one row). It
must have four columns named SampleName, Marker, Allele, and Height. The
SampleName column should hold character values. Here is an example
included with bistro
:
head(bloodmeal_profiles)
#> SampleName Marker Allele Height
#> 1 evid1 AMEL X 2136
#> 2 evid1 AMEL Y 1015
#> 3 evid1 D10S1248 13 1856
#> 4 evid1 D10S1248 14 155
#> 5 evid1 D10S1248 15 1045
#> 6 evid1 D12S391 18 297
In this example dataset, we have 4 different bloodmeals:
unique(bloodmeal_profiles$SampleName)
#> [1] "evid1" "evid2" "evid3" "evid4"
Human STR profiles (human_profiles
)
The human STR profiles dataset takes a similar format to the bloodmeal STR profiles dataset - it should have one row per allele for each human and marker, but does not require peak heights. Again, homozygous markers should only be included once. The dataset must have three columns named SampleName, Marker, and Allele. The SampleName column should hold character values. Here is an example:
head(human_profiles)
#> SampleName Marker Allele
#> 1 00-JP0001-14_20142342311_NO-3241 AMEL X
#> 2 00-JP0001-14_20142342311_NO-3241 AMEL Y
#> 3 00-JP0001-14_20142342311_NO-3241 D10S1248 12
#> 4 00-JP0001-14_20142342311_NO-3241 D10S1248 13
#> 5 00-JP0001-14_20142342311_NO-3241 D12S391 17
#> 6 00-JP0001-14_20142342311_NO-3241 D12S391 18
In this example dataset, we have 3 different human profiles:
unique(human_profiles$SampleName)
#> [1] "00-JP0001-14_20142342311_NO-3241" "P1"
#> [3] "P2"
STR genotyping kit (kit
)
bistro
requires as input the STR genotyping kit name.
The euroformix
package includes kit parameters for 23
common STR genotyping kits. It also includes a handy function to list
all available kits:
getKit()
#> [1] "testkit" "ESX16" "ESX17" "ESX17Fast"
#> [5] "ESI17Fast" "Fusion" "Fusion 6C" "SGMPlus"
#> [9] "Identifiler" "NGM" "NGMSElect" "NGMDetect"
#> [13] "GlobalFiler" "PowerPlex16" "PowerPlex21" "24plex"
#> [17] "ESSPlex" "ESSplexPlus" "ESSplexSEPlus" "ESSplexSEQS"
#> [21] "ForenSeq" "VFP" "MiniFiler"
If the kit you used to genotype samples is not available in the
defaults, you will have to modify kit.txt
within the
euroformix
package with the appropriate kit parameters. The
required parameters can usually be found in the kit manufacturer’s
documentation or or on their website.
Bloodmeal allele peak threshold (peak_thresh
)
bistro()
also requires a bloodmeal allele peak threshold
(in RFUs) to be provided. All peaks with heights below this threshold
will be removed before matching. If prior filtering was performed on the
bloodmeal profiles based on peak height, then this number should be
equal to or greater than that threshold.
Human population allele frequencies
(pop_allele_freqs
)
Human population frequencies for each allele at each locus may be supplied. If you would prefer, population allele frequencies can be computed from the human STR profiles (see below for more details). Note that the latter option is not suggested when a small set of human reference profiles is being used.
The euroformix website provides some publicly available population frequency datasets. Additionally, you can get United States population allele frequencies for many alleles from NIST.
If you would like to input population allele frequencies, the dataset should contain one column for each STR marker and one row for each allele. The alleles should be listed in the first column of the dataset and the column should be named “Allele”. The entries for each marker-allele combination are the population allele frequency or NA if that allele does not exist for that marker. Here is an example with the loci from the PowerPlex ESX 17 System (ESX17) kit.
head(pop_allele_freqs)
#> Allele D3S1358 TH01 D21S11 D18S51 D10S1248 D1S1656 D2S1338
#> 1 5.0 NA 0.002598441 NA NA NA NA NA
#> 2 6.0 NA 0.209274435 NA NA NA NA NA
#> 3 6.3 NA NA NA NA NA NA NA
#> 4 7.0 NA 0.212472516 NA 0.0008984726 NA NA NA
#> 5 8.0 NA 0.083649810 NA NA NA NA NA
#> 6 8.2 NA NA NA NA NA NA NA
#> D16S539 D22S1045 VWA D8S1179 FGA D2S441 D12S391 D19S433 SE33
#> 1 NA NA NA NA NA NA NA NA NA
#> 2 0.0008992806 NA NA NA NA NA NA NA NA
#> 3 NA NA NA NA NA NA NA NA NA
#> 4 NA NA NA NA NA NA NA NA NA
#> 5 0.0086930456 NA NA 0.01130226 NA NA NA NA NA
#> 6 NA NA NA NA NA NA NA NA 0.002506945
Running bistro()
Now you have enough background to run bistro()
, so let’s
test it out using the example data included in the package:
bistro_output <- bistro(
bloodmeal_profiles = bloodmeal_profiles,
human_profiles = human_profiles,
pop_allele_freqs = pop_allele_freqs,
kit = "ESX17",
peak_thresh = 200
)
#> 1/17 markers in kit but not in pop_allele_freqs: AMEL
#> Formatting bloodmeal profiles
#> Removing 6 peaks under the threshold of 200 RFU.
#> For 1/4 bloodmeal ids, all peaks are below the threshold
#> Formatting human profiles
#> Markers being used: D10S1248, D12S391, D16S539, D18S51, D19S433, D1S1656, D21S11, D22S1045, D2S1338, D2S441, D3S1358, D8S1179, FGA, SE33, TH01, VWA
#> Calculating log10LRs
#> # bloodmeal ids: 3
#> # human ids: 3
#> Bloodmeal id 1/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Bloodmeal id 2/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Bloodmeal id 3/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Identifying matches
You’ll notice a bunch of messages print out to the screen:
- One of the markers (AMEL) is in the kit but not in
pop_allele_freqs
. - 6 peaks under the 200 RFU threshold were removed, and after filtering one of the bloodmeals did not have any peaks remaining above the threshold.
- Notification that log10LRs (log10 likelihood ratios) are being calculated for each bloodmeal-human pair (for 4 bloodmeals and 3 humans). This can take a while, so the progress is also printed out.
- Notification that matches are being identified.
Let’s take a look at the output:
bistro_output
#> # A tibble: 4 × 8
#> bloodmeal_id locus_count est_noc match human_id log10_lr notes thresh_low
#> <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <dbl>
#> 1 evid1 16 2 yes P1 21.8 passed al… 9.5
#> 2 evid1 16 2 yes P2 10.3 passed al… 9.5
#> 3 evid2 1 2 no NA NA all log10… NA
#> 4 evid3 8 1 no NA NA all log10… NA
The output contains 8 columns:
-
bloodmeal_id
: bloodmeal sample name. -
est_noc
: estimated number of contributors (NOCs) to the bloodmeal. When this equals one, it’s predicted to be a single-source bloodmeal; when it is greater than one, it’s predicted to be a multi-source bloodmeal. Note that sources here mean humans only. -
locus_count
: number of loci STR-typed in the bloodmeal. -
match
: whether the bloodmeal STR profile matched to a human in the database (yes or no). -
human_id
: human match for the bloodmeal (NA if no match). -
log10_lr
: log10 likelihood ratio of the bloodmeal-human match (NA if no match). The numerator of the log10LR is the likelihood that the person was bitten by the bloodmeal and the denominator is the likelihood that someone else was bitten by the bloodmeal. -
notes
: why the bloodmeal does or doesn’t have a match (see below for more details). -
thresh_low
: log10LR threshold at which the match was made.
An important thing to notice is that, even though we have only 4 bloodmeals, there are 5 rows in the output dataset. This is because, while each bloodmeal will appear at least once in the dataset, multiple rows are present for bloodmeals that match to multiple people (one row per match). For example, here, evid1 matches to 2 different people (P1 and P2) Please keep this in mind when performing data analysis.
The notes column provides more information on why there was or was not a match:
- passed all filters: there was a match.
- > min NOC matches: there were more matches than expected and they could not be resolved. This can happen when there are closely related people in the human reference dataset.
- no shared alleles: no people in the human reference dataset had alleles that overlapped with the alleles in the bloodmeal.
- all log10LRs < 1.5: none of the log10LRs were high enough to be considered a match.
- euroformix error: the
euroformix::contLikSearch()
function did not run successfully. This often happens with very incomplete bloodmeal profiles (i.e. where only a few markers were succesfully STR-typed). - no peaks above threshold: all bloodmeal STR peaks were below the peak height threshold.
- timed out: log-likelihood took too long to calculate. If you want to
give it more time, you can modify the
time_limit
argument (see below).
Using computed population allele frequencies
If you want to have the population allele frequencies computed from
the inputted human reference profiles, you must set
calc_allele_freqs = TRUE
:
bistro(
bloodmeal_profiles = bloodmeal_profiles,
human_profiles = human_profiles,
calc_allele_freqs = TRUE,
kit = "ESX17",
peak_thresh = 200
)
Note that, while we use the builtin dataset as an example here, we do not recommend using this method for a human dataset with only 3 people. An example of when it may make sense is if you have a relatively comprehensive set of human reference profiles for your study site.
Running bistro()
on a subset of samples
bistro()
can take a long time to run, especially if you
have many samples or a closely related reference population. Therefore,
before running it with all of your samples, we highly recommend testing
it out on a small subset of samples to make sure everything is working
as you expect. To do this, you can use the bloodmeal_ids
and human_ids
arguments to tell bistro()
what
ids you want to compare:
bistro(
bloodmeal_profiles = bloodmeal_profiles,
human_profiles = human_profiles,
pop_allele_freqs = pop_allele_freqs,
kit = "ESX17",
peak_thresh = 200,
bloodmeal_ids = c("evid1", "evid2"),
human_ids = "P1"
)
This can also be useful if you would like to parallelize
bistro()
, for example by using a workflow manager such as
Snakemake or NextFlow. If you’d like to try it out, here
is a template Snakemake pipeline for running bistro. We highly recommend
doing this if you are planning to run it on many samples.
Other arguments
Other arguments you can modify when running bistro()
include:
Used by bistro
:
-
rm_twins
: whether or not to remove identical STR profiles that are likely twins prior to identifying matches. Note that matches with these people cannot be resolved (default: TRUE) -
time_limit
: longest amount of time allowed to calculate the log-likelihood for one bloodmeal-human pair, in minutes (default: 3)
Used by euroformix::contLikSearch()
when calculating
log-likelihoods:
-
rm_markers
: markers to not use for calculating LRs. By default, AMEL is removed because it is not standard to use for calculating LRs as it is a sex-specific marker -
model_degrad
: whether or not to model peak degradation (default: TRUE) -
model_bw_stutt
andmodel_fw_stutt
: whether or not to model peak backward and forward stutter (default: FALSE) -
difftol
: difference tolerance in log-likelihood across 2 iterations (default: 1) -
threads
: number of threads to use when computing log-likelihoods (default: 4) -
seed
: seed for reproducible results (default: 1) -
return_lr
: whether or not to return all bloodmeal-human log10LRs. This is useful if you want to investigate the matches or perform static threshold matching (see below) (default: FALSE)
Other matching algorithms
We have also implemented a few other matching algorithms including:
-
match_exact()
: Exact matching of all markers and alleles -
match_similarity()
: Similarity matching based on the highest similarity value between humans in a reference database -
match_static_thresh()
: Matching based on static log10LR threshold.
These algorithms take similar inputs to bistro()
, or in
the case of match_static_thresh()
, a bistro()
output.
Look out for our upcoming manuscript for a comparison of these methods.
Identifying bloodmeals derived from the same host without a reference human database
If you don’t have a reference database of human samples but would
still like to identify bloodmeals derived from the same host, you can
use the create_db_from_bloodmeals()
function, and then
input that human database into the bistro()
function. To
see how this works, first load example data from the bistro
manuscript:
samples <- readr::read_csv("https://raw.githubusercontent.com/duke-malaria-collaboratory/bistro_validation/main/data/provedit/provedit_samples_mass200thresh.csv")
To create a human database from these samples, you can run
create_db_from_bloodmeals()
, which creates human profiles
from complete single-source bloodmeal profiles:
hu_db <- create_db_from_bloodmeals(samples, "identifiler", 200)
Then you can run bistro()
as usual:
Running bistro step-by-step
If you’re interested in learning how to run sub-functions of
bistro()
individually, head over to
vignette("step-by-step")
.
Have questions or need help troubleshooting?
Open up an issue on our GitHub page or contact us (Christine: christine.markwalter@duke.edu, Zena: zenalapp@gmail.com) and we can help out.