Introduction to using bistro

The bistro algorithm

bistro is an algorithm, an R package, and a function within said R package. It employs methods from forensics to identify matches between bloodmeals and the people they bit using short tandem repeat (STR) profiles of human blood from freshly fed bloodmeals and from people. It can be used to match multi-source bloodmeals and bloodmeals with incomplete STR profiles. Note that while you can use it for matching any STR profiles for research purposes, we very much do not recommend using it for forensic purposes. For more details about the algorithm, please refer to the bistro manuscript (will add link once it’s posted).

A core part of the algorithm is the contLikSearch() function from the euroformix package, which is used to calculate log10 likelihood ratios (log10LRs). The numerator of the log10LR is the likelihood that the person was bitten by the vector and the denominator is the likelihood that someone else was bitten by the vector. Here is more information about euroformix: - Manuscript: EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts - GitHub - Website that explains GUI

bistro uses the log10LRs to identify bloodmeal-human matches using a per-bloodmeal dynamic threshold approach.

Running bistro in parallel

First, if you want to run bistro on many samples and would prefer to run it in parallel, potentially on a cluster, check out our template bistro Snakemake pipeline.

Installing the `bistro` R package

To install bistro, run the following commands:

install.packages("remotes") # install remotes if needed
remotes::install_github("duke-malaria-collaboratory/bistro")

If you run into errors, try the following:

Windows: install Rtools
Mac: install Xcode from the Mac App store

Getting started with `bistro`

The most important function in bistro is bistro(). This document will take you through all of the required and optional inputs, as well as the output.

In summary, you must provide:

Bloodmeal STR profiles
Human STR profiles
The STR genotyping kit name
An allele peak height threshold (in relative fluorescence units, RFUs) for the bloodmeal STR profiles

The function outputs matches for each bloodmeal-human pair.

To follow along with the vignette, first load bistro:

library(bistro)
#> Loading required package: euroformix

Note that when we load bistro, we also load euroformix because bistro depends on the euroformix package.

Required inputs

Before we run bistro() using example data, let’s learn more about the required inputs.

Two required, and one optional but often recommended, datasets are required to run bistro(): bloodmeal and human STR profiles are required, and human population allele frequencies are optional. The other two required inputs are the STR genotyping kit name and the bloodmeal STR allele peak height threshold.

Bloodmeal STR profiles (`bloodmeal_profiles`)

The bloodmeal STR profiles dataset should have one row per allele for each bloodmeal and marker, and include the allele peak height. Homozygous markers should only be included once (i.e. as one row). It must have four columns named SampleName, Marker, Allele, and Height. The SampleName column should hold character values. Here is an example included with bistro:

head(bloodmeal_profiles)
#>   SampleName   Marker Allele Height
#> 1      evid1     AMEL      X   2136
#> 2      evid1     AMEL      Y   1015
#> 3      evid1 D10S1248     13   1856
#> 4      evid1 D10S1248     14    155
#> 5      evid1 D10S1248     15   1045
#> 6      evid1  D12S391     18    297

In this example dataset, we have 4 different bloodmeals:

unique(bloodmeal_profiles$SampleName)
#> [1] "evid1" "evid2" "evid3" "evid4"

Human STR profiles (`human_profiles`)

The human STR profiles dataset takes a similar format to the bloodmeal STR profiles dataset - it should have one row per allele for each human and marker, but does not require peak heights. Again, homozygous markers should only be included once. The dataset must have three columns named SampleName, Marker, and Allele. The SampleName column should hold character values. Here is an example:

head(human_profiles)
#>                         SampleName   Marker Allele
#> 1 00-JP0001-14_20142342311_NO-3241     AMEL      X
#> 2 00-JP0001-14_20142342311_NO-3241     AMEL      Y
#> 3 00-JP0001-14_20142342311_NO-3241 D10S1248     12
#> 4 00-JP0001-14_20142342311_NO-3241 D10S1248     13
#> 5 00-JP0001-14_20142342311_NO-3241  D12S391     17
#> 6 00-JP0001-14_20142342311_NO-3241  D12S391     18

In this example dataset, we have 3 different human profiles:

unique(human_profiles$SampleName)
#> [1] "00-JP0001-14_20142342311_NO-3241" "P1"                              
#> [3] "P2"

STR genotyping kit (`kit`)

bistro requires as input the STR genotyping kit name. The euroformix package includes kit parameters for 23 common STR genotyping kits. It also includes a handy function to list all available kits:

getKit()
#>  [1] "testkit"       "ESX16"         "ESX17"         "ESX17Fast"    
#>  [5] "ESI17Fast"     "Fusion"        "Fusion 6C"     "SGMPlus"      
#>  [9] "Identifiler"   "NGM"           "NGMSElect"     "NGMDetect"    
#> [13] "GlobalFiler"   "PowerPlex16"   "PowerPlex21"   "24plex"       
#> [17] "ESSPlex"       "ESSplexPlus"   "ESSplexSEPlus" "ESSplexSEQS"  
#> [21] "ForenSeq"      "VFP"           "MiniFiler"

If the kit you used to genotype samples is not available in the defaults, you will have to modify kit.txt within the euroformix package with the appropriate kit parameters. The required parameters can usually be found in the kit manufacturer’s documentation or or on their website.

Bloodmeal allele peak threshold (`peak_thresh`)

bistro() also requires a bloodmeal allele peak threshold (in RFUs) to be provided. All peaks with heights below this threshold will be removed before matching. If prior filtering was performed on the bloodmeal profiles based on peak height, then this number should be equal to or greater than that threshold.

Human population allele frequencies (`pop_allele_freqs`)

Human population frequencies for each allele at each locus may be supplied. If you would prefer, population allele frequencies can be computed from the human STR profiles (see below for more details). Note that the latter option is not suggested when a small set of human reference profiles is being used.

The euroformix website provides some publicly available population frequency datasets. Additionally, you can get United States population allele frequencies for many alleles from NIST.

If you would like to input population allele frequencies, the dataset should contain one column for each STR marker and one row for each allele. The alleles should be listed in the first column of the dataset and the column should be named “Allele”. The entries for each marker-allele combination are the population allele frequency or NA if that allele does not exist for that marker. Here is an example with the loci from the PowerPlex ESX 17 System (ESX17) kit.

head(pop_allele_freqs)
#>   Allele D3S1358        TH01 D21S11       D18S51 D10S1248 D1S1656 D2S1338
#> 1    5.0      NA 0.002598441     NA           NA       NA      NA      NA
#> 2    6.0      NA 0.209274435     NA           NA       NA      NA      NA
#> 3    6.3      NA          NA     NA           NA       NA      NA      NA
#> 4    7.0      NA 0.212472516     NA 0.0008984726       NA      NA      NA
#> 5    8.0      NA 0.083649810     NA           NA       NA      NA      NA
#> 6    8.2      NA          NA     NA           NA       NA      NA      NA
#>        D16S539 D22S1045 VWA    D8S1179 FGA D2S441 D12S391 D19S433        SE33
#> 1           NA       NA  NA         NA  NA     NA      NA      NA          NA
#> 2 0.0008992806       NA  NA         NA  NA     NA      NA      NA          NA
#> 3           NA       NA  NA         NA  NA     NA      NA      NA          NA
#> 4           NA       NA  NA         NA  NA     NA      NA      NA          NA
#> 5 0.0086930456       NA  NA 0.01130226  NA     NA      NA      NA          NA
#> 6           NA       NA  NA         NA  NA     NA      NA      NA 0.002506945

Running `bistro()`

Now you have enough background to run bistro(), so let’s test it out using the example data included in the package:

bistro_output <- bistro(
  bloodmeal_profiles = bloodmeal_profiles,
  human_profiles = human_profiles,
  pop_allele_freqs = pop_allele_freqs,
  kit = "ESX17",
  peak_thresh = 200
)
#> 1/17 markers in kit but not in pop_allele_freqs: AMEL
#> Formatting bloodmeal profiles
#> Removing 6 peaks under the threshold of 200 RFU.
#> For 1/4 bloodmeal ids, all peaks are below the threshold
#> Formatting human profiles
#> Markers being used: D10S1248, D12S391, D16S539, D18S51, D19S433, D1S1656, D21S11, D22S1045, D2S1338, D2S441, D3S1358, D8S1179, FGA, SE33, TH01, VWA
#> Calculating log10LRs
#> # bloodmeal ids: 3
#> # human ids: 3
#> Bloodmeal id 1/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Bloodmeal id 2/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Bloodmeal id 3/3
#> Human id 1/3
#> Human id 2/3
#> Human id 3/3
#> Identifying matches

You’ll notice a bunch of messages print out to the screen:

One of the markers (AMEL) is in the kit but not in pop_allele_freqs.
6 peaks under the 200 RFU threshold were removed, and after filtering one of the bloodmeals did not have any peaks remaining above the threshold.
Notification that log10LRs (log10 likelihood ratios) are being calculated for each bloodmeal-human pair (for 4 bloodmeals and 3 humans). This can take a while, so the progress is also printed out.
Notification that matches are being identified.

Let’s take a look at the output:

bistro_output
#> # A tibble: 4 × 8
#>   bloodmeal_id locus_count est_noc match human_id log10_lr notes      thresh_low
#>   <chr>              <int>   <dbl> <chr> <chr>       <dbl> <chr>           <dbl>
#> 1 evid1                 16       2 yes   P1           21.8 passed al…        9.5
#> 2 evid1                 16       2 yes   P2           10.3 passed al…        9.5
#> 3 evid2                  1       2 no    NA           NA   all log10…       NA  
#> 4 evid3                  8       1 no    NA           NA   all log10…       NA

The output contains 8 columns:

bloodmeal_id: bloodmeal sample name.
est_noc: estimated number of contributors (NOCs) to the bloodmeal. When this equals one, it’s predicted to be a single-source bloodmeal; when it is greater than one, it’s predicted to be a multi-source bloodmeal. Note that sources here mean humans only.
locus_count: number of loci STR-typed in the bloodmeal.
match: whether the bloodmeal STR profile matched to a human in the database (yes or no).
human_id: human match for the bloodmeal (NA if no match).
log10_lr: log10 likelihood ratio of the bloodmeal-human match (NA if no match). The numerator of the log10LR is the likelihood that the person was bitten by the bloodmeal and the denominator is the likelihood that someone else was bitten by the bloodmeal.
notes: why the bloodmeal does or doesn’t have a match (see below for more details).
thresh_low: log10LR threshold at which the match was made.

An important thing to notice is that, even though we have only 4 bloodmeals, there are 5 rows in the output dataset. This is because, while each bloodmeal will appear at least once in the dataset, multiple rows are present for bloodmeals that match to multiple people (one row per match). For example, here, evid1 matches to 2 different people (P1 and P2) Please keep this in mind when performing data analysis.

The notes column provides more information on why there was or was not a match:

passed all filters: there was a match.
> min NOC matches: there were more matches than expected and they could not be resolved. This can happen when there are closely related people in the human reference dataset.
no shared alleles: no people in the human reference dataset had alleles that overlapped with the alleles in the bloodmeal.
all log10LRs < 1.5: none of the log10LRs were high enough to be considered a match.
euroformix error: the euroformix::contLikSearch() function did not run successfully. This often happens with very incomplete bloodmeal profiles (i.e. where only a few markers were succesfully STR-typed).
no peaks above threshold: all bloodmeal STR peaks were below the peak height threshold.
timed out: log-likelihood took too long to calculate. If you want to give it more time, you can modify the time_limit argument (see below).

Using computed population allele frequencies

If you want to have the population allele frequencies computed from the inputted human reference profiles, you must set calc_allele_freqs = TRUE:

bistro(
  bloodmeal_profiles = bloodmeal_profiles,
  human_profiles = human_profiles,
  calc_allele_freqs = TRUE,
  kit = "ESX17",
  peak_thresh = 200
)

Note that, while we use the builtin dataset as an example here, we do not recommend using this method for a human dataset with only 3 people. An example of when it may make sense is if you have a relatively comprehensive set of human reference profiles for your study site.

Running `bistro()` on a subset of samples

bistro() can take a long time to run, especially if you have many samples or a closely related reference population. Therefore, before running it with all of your samples, we highly recommend testing it out on a small subset of samples to make sure everything is working as you expect. To do this, you can use the bloodmeal_ids and human_ids arguments to tell bistro() what ids you want to compare:

bistro(
  bloodmeal_profiles = bloodmeal_profiles,
  human_profiles = human_profiles,
  pop_allele_freqs = pop_allele_freqs,
  kit = "ESX17",
  peak_thresh = 200,
  bloodmeal_ids = c("evid1", "evid2"),
  human_ids = "P1"
)

This can also be useful if you would like to parallelize bistro(), for example by using a workflow manager such as Snakemake or NextFlow. If you’d like to try it out, here is a template Snakemake pipeline for running bistro. We highly recommend doing this if you are planning to run it on many samples.

Other arguments

Other arguments you can modify when running bistro() include:

Used by bistro:

rm_twins: whether or not to remove identical STR profiles that are likely twins prior to identifying matches. Note that matches with these people cannot be resolved (default: TRUE)
time_limit: longest amount of time allowed to calculate the log-likelihood for one bloodmeal-human pair, in minutes (default: 3)

Used by euroformix::contLikSearch() when calculating log-likelihoods:

rm_markers: markers to not use for calculating LRs. By default, AMEL is removed because it is not standard to use for calculating LRs as it is a sex-specific marker
model_degrad: whether or not to model peak degradation (default: TRUE)
model_bw_stutt and model_fw_stutt: whether or not to model peak backward and forward stutter (default: FALSE)
difftol: difference tolerance in log-likelihood across 2 iterations (default: 1)
threads: number of threads to use when computing log-likelihoods (default: 4)
seed: seed for reproducible results (default: 1)
return_lr: whether or not to return all bloodmeal-human log10LRs. This is useful if you want to investigate the matches or perform static threshold matching (see below) (default: FALSE)

Other matching algorithms

We have also implemented a few other matching algorithms including:

match_exact(): Exact matching of all markers and alleles
match_similarity(): Similarity matching based on the highest similarity value between humans in a reference database
match_static_thresh(): Matching based on static log10LR threshold.

These algorithms take similar inputs to bistro(), or in the case of match_static_thresh(), a bistro() output.

Look out for our upcoming manuscript for a comparison of these methods.

Identifying bloodmeals derived from the same host without a reference human database

If you don’t have a reference database of human samples but would still like to identify bloodmeals derived from the same host, you can use the create_db_from_bloodmeals() function, and then input that human database into the bistro() function. To see how this works, first load example data from the bistro manuscript:

samples <- readr::read_csv("https://raw.githubusercontent.com/duke-malaria-collaboratory/bistro_validation/main/data/provedit/provedit_samples_mass200thresh.csv")

To create a human database from these samples, you can run create_db_from_bloodmeals(), which creates human profiles from complete single-source bloodmeal profiles:

hu_db <- create_db_from_bloodmeals(samples, "identifiler", 200)

Then you can run bistro() as usual:

bistro(samples, hu_db, "identifiler", 200,
  bloodmeal_ids = "A02_RD14-0003-23d3a-0.2IP-Q1.2_001.10sec.fsa",
  human_ids = paste0("H", 20:24),
  calc_allele_freqs = TRUE
)

Running bistro step-by-step

If you’re interested in learning how to run sub-functions of bistro() individually, head over to vignette("step-by-step").

Have questions or need help troubleshooting?

Open up an issue on our GitHub page or contact us (Christine: christine.markwalter@duke.edu, Zena: zenalapp@gmail.com) and we can help out.

The bistro algorithm

Running bistro in parallel

Installing the bistro R package

Getting started with bistro

Required inputs

Bloodmeal STR profiles (bloodmeal_profiles)

Human STR profiles (human_profiles)

STR genotyping kit (kit)

Bloodmeal allele peak threshold (peak_thresh)

Human population allele frequencies (pop_allele_freqs)

Running bistro()