Choosing BIGSI Parameters

TLDR

If you're indexing bacteria with expected length ~5,000,000 reasonable parameters are k=31 m=25,000,000 h=3.

Assuming you're going to be searching for queries > 50bp long than choose bloom filter parameters which result in a false positive rate p<=0.3.

Choosing parameters

The choice of BIGSI parameters (m, h), depends on: the maximum number of k-mers expected in any colour (K_max), the number of datasets/colours (N) expected, the shortest length of the query sequence to be supported (L_min), the k-mer size (k) and the maximum number of acceptable false discoveries per query (q_max). Since each query L will consist of L = L_min - k + 1 k-mers, the expected number of false discoveries (V) for any query can be calculated as q=E[V]=Np^L, where p is the false positive rate of the bloom filter. Therefore, the desired false positive rate per bloom filter for q_max is p = (q_max/N)^(1/L). For a given cardinality (n) and desired false positive rate, optimal bloom filter parameters can be determined by:

m = - n*ln(p) / (ln(2)^2)
h = - ln(p) / ln(2)

Resulting in optimal BIGSI parameters of:

m = - (K_max ln(q_max/N)) / (L ln(2)^2 )

h = - ln(q_max/N) / (L * ln(2) )