Compound Library Comparison


Authors: Chao Ma chm57@pitt.edu; Sean Xie xix15@pitt.edu

Developed at Department of Computational and Systems Biology, University of Pittsburgh

Version          Updated on 11/17/2010




Library comparison is used to characterize the degree of similarity or overlapping between two compound collections, for chemical diversity analysis, for combinatory library creation, and for data mining. Chemical similarity calculation plays an essential role in compound library comparison. Similarity score between pairs of compounds is usually measured through Tanimoto coefficients. Compound library comparison requires pair wise Tanimoto calculation, which leads to quadratic time complexity in the size of compound libraries. Despite the advance of computer hardware, it remains as a challenge to process large public databases, such as Zinc and Pubchem database.


The design of modern graphics process units (GPU) points out potential solution to these problems. Generally, GPUs have much higher memory bandwidth than traditional CPUs. In addition, multiple threads can be launched concurrently to perform parallel computing on GPU. These features allow GPUs to achieve better performance for arithmetic intensive tasks.


In this project, software is developed to accelerate compound library comparison on GPUs. Our test calculation shows that our program (on GTS 250) runs as fast as 30 times as commercial software, Sybyl SELECTOR Compare Database (on Intel Xeon 3.0 GHz). The GPU program is implemented using Visual Studio C++ 2005, together with CUDA development toolkits. The binary file and source code are available to the public. Following functions are implemented in the program:

·         Graphical user interface

·         All-vs-all Tanimoto matrix calculation using sparse array or compressed integer fingerprint

·         Open architecture to import any binary molecular fingerprints for similarity calculation

·         Parallel nearest neighbor search and histogram creation

The source code can be easily revised for k-nearest-neighbor search or threshold similarity search as well.


Source Code and Binary


The binary is compiled for Microsoft Windows platform. Download from hereuntitled. CUDA runtime DLL file: download

System requirement

Supported operating system: Windows XP, Windows Vista or Windows 7

Supported hardware: CUDA compatible graphic cards. Click here for a full list of compatible device

Please install latest graphic card drivers.

The program is developed as “green” software, so installation is not required.


Download source code from hereuntitled.

The source code is for Visual Studio C++ 2005. Microsoft Foundation Class (MFC) is required for automatic compiling. Please install latest NVIDIA CUDA Toolkit and NVIDIA GPU Computing SDK.


The program and source code are free for scientific use. Please contact me, if you are planning to use the software or source code for commercial purposes. The software must not be further distributed without prior permission of the author.


How to Use


Download the binary and run the program, “CudaCLA.exe”. The interface is displayed below:

·         Hardware Information. In this section, the specifications of CUDA-compatible device are displayed, including GPU model, graphic memory, number of multi-processors and clock rate. Error message will be displayed if no compatible device is found, or no compatible driver is installed.

·         Library Information. The pink box shows some details about imported candidate and reference libraries. In library comparison, the library to be compared to is referred as reference library, while the other is names as candidate library. The goal is to characterize how well a candidate library is represented in a reference library. This section shows the size of libraries, the length of molecular fingerprints and corresponding fingerprint format. In cheminformatics, similarity score is usually represented by the Tanimoto coefficients between a pair of molecular fingerprints. The program can index and import two types of fingerprints: dense format and sparse format. Many popular fingerprints have high bit coverage, i.e. high ratio between the number of “1” bits to the fingerprint length. Maccs, Unity, FP2 fingerprints are some typical examples. On the other hand, some other fingerprints have high sparsity, such as Molrpint2D fingerprint. Compressed integer format works well for dense fingerprints, and sparse array is designed for sparse fingerprints. To compare candidate library with reference library, both of them have to be in the same format, i.e. either sparse array format or integer format.

·         Job Status. The progress and throughput metric are displayed in the light blue box when library comparison is undergoing.

Theoretically, the program can import compound libraries encoded by any binary molecular fingerprints. To illustrate this, a module is implemented to process and index raw Unity fingerprint that is generated by Tripos Sybyl software.


Take integer fingerprint as an example: “Index as Reference Library” and “Index as Candidate Library” in the menu are used to convert raw Unity fingerprint into pre-defined CudaCLA data format, for reference library and candidate library respectively. The details of CudaCLA format can be found in the source code. Once converted into CudaCLA format, libraries can be imported through “Load Reference Library” or “Load Candidate Library”. “Compare Database” command is used for library comparison on GPU. By default, the library comparison in CudaCLA program is based on Unity fingerprint. However, other types of fingerprints can also be supported. To do this, scripts or codes should be developed to convert those fingerprints into CudaCLA data format, and then load the processed files. The general flowchart is shown below:

It is worth to mention that users have an option to exclude self-comparison, when reference and candidate libraries are the same.

The following two screenshots present computation progress and result:

The left picture exhibits progress bar, elapsed time and average throughput. The throughput is in kilo Tanimoto per second. The right picture shows the distribution of Tanimoto coefficients in a 100-bin histogram. Percentage value is displayed (in yellow bar) for the bin where mouse curse is. Users can switch between text mode and histogram mode by clicking “Show Status” or “Show Histogram” in “Option” menu. The histogram can be exported through “Export” function in “Option” menu.


Sample Datasets


Unity fingerprints have been generated for three libraries that came from Timtec and Maybridge screening collections.

APL10K, APL25K, MAYBRIDGE. These files are in raw text format.

Start program CudaCLA.exe. Follow the instructions:

Compressed Fingerprint -> Index as Reference Library. Choose the downloaded “APL10K.txt” in the pop-up dialog box. Then save processed file as “APL10K.ifp”

Compressed Fingerprint -> Index as Candidate Library. Choose the downloaded “APL25K.txt” in the pop-up dialog box. Then save processed file as “APL25K.ifp”


Load reference and candidate library files:

Compressed Fingerprint -> Load Reference Library. Select APL10K.ifp.

Compressed Fingerprint -> Load Candidate Library. Select APL25K.ifp.

Compressed Fingerprint -> Compare Database


After calculation is complete, a histogram of Tanimoto is displayed. Move mouse cursor to examine values at each bin. Go to Option -> Export to save the result.

Similar procedures apply to sparse array algorithm under menu “Sparse Array”.