BItsliced Genomic Signature Index [bigsi]

BItsliced Genomic Signature Index [BIGSI] Docs

Welcome to the BIGSI documentation. You'll find comprehensive guides and documentation to help you start working with BIGSI as quickly as possible.

BIGSIs–BItsliced Genomic Signature Indexes–allow for efficient indexing and search in very large collections of WGS data, in particular bacterial or viral data sets WGS data sets. BIGSI can index and query raw, or assembled data.

A prebuilt index is available for download at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/all-microbial-index-v0.3 or a hosted demo is available here http://www.bigsi.io/.

Please cite our paper if you use this tool in your research:
'Ultra-fast search of all deposited bacterial and viral genomic data' http://dx.doi.org/10.1038/s41587-018-0010-1

Get Started    Guides

Choosing BIGSI Parameters

TLDR

If you're indexing bacteria with expected length ~5,000,000 reasonable parameters are k=31 m=25,000,000 h=3.

Assuming you're going to be searching for queries > 50bp long than choose bloom filter parameters which result in a false positive rate p<=0.3.

Choosing parameters

The choice of BIGSI parameters (m, h), depends on: the maximum number of k-mers expected in any colour (K_max), the number of datasets/colours (N) expected, the shortest length of the query sequence to be supported (L_min), the k-mer size (k) and the maximum number of acceptable false discoveries per query (q_max). Since each query L will consist of L = L -k + 1 k-mers the expected number of false discoveries (V) for any query can be calculated as q=E[V]=Np^L where p is the false positive rate of the bloom filter. Therefore, the desired false positive rate per bloom filter, for q_max is p = (q_max/N)^(1/L_min ). For a given cardinality (n) and desired false positive rate, optimal bloom filter parameters can be determined by

m=-nln(p)/ln(2^2 )
h=-ln(p)/ln(2)

Resulting in optimal BIGSI parameters of:

m=-(K_max ln(q_max N))/(L_min ln(2^2 ) )

h=ln(q_max N)/(L_min ln(2) )