Choosing BIGSI Parameters

TLDR

If you're indexing bacteria with expected length ~5,000,000 reasonable parameters are k=31 m=25,000,000 h=3.

Assuming you're going to be searching for queries > 50bp long than choose bloom filter parameters which result in a false positive rate p<=0.3.

Choosing parameters

The choice of BIGSI parameters (m, h), depends on: the maximum number of k-mers expected in any colour (K_max), the number of datasets/colours (N) expected, the shortest length of the query sequence to be supported (L_min), the k-mer size (k) and the maximum number of acceptable false discoveries per query (q_max). Since each query L will consist of L = L -k + 1 k-mers the expected number of false discoveries (V) for any query can be calculated as q=E[V]=Np^L where p is the false positive rate of the bloom filter. Therefore, the desired false positive rate per bloom filter, for q_max is p = (q_max/N)^(1/L_min ). For a given cardinality (n) and desired false positive rate, optimal bloom filter parameters can be determined by

m=-nln(p)/ln(2^2 )
h=-ln(p)/ln(2)

Resulting in optimal BIGSI parameters of:

m=-(K_max ln(q_max N))/(L_min ln(2^2 ) )

h=ln(q_max N)/(L_min ln(2) )