Monday, September 8, 2025

Going Native - Java Native Interface (JNI)

Why do humans like old things?

Dr. Noonien Soong, Star Trek the Next Generation

Although I haven't actually done it very much, I've always been fascinated by the idea of calling into old code from modern applications. Who knows what value is locked away in those old libraries? Graphics, matrix calculations, statistics, quantum mechanical calculations, etc. I want to be able to do it all!

In reality, old code is probably dusty, unmaintained, and harder to use than modern code. I'm still interested in being able to access it.

As discussed in the introduction to this series, I wrote a small library to do matrix calculations on rational numbers so that I could focus on the calculations without worrying about data loss or errors due to rounding. This post is about calling into it from Java via the Java Native Interface (JNI).

Starting point

To get a basic education I started with Baeldung's Tutorial on JNI. This gave a few basic examples of how to write the Java code, compile it, generate the C header file, write and compile the implementation of that, and call it all together. In particular, the Using Objects and Calling Java Methods From Native Code section was a good introduction to generating Java objects from the native side of the world.

I did most of this work in an Ubuntu image in a WSL on a Windows machine. I used Java 11 SDK for the Java compilation steps.

Calling RMatrix from Java, creating Java objects from the results

I wrote some simple Java classes as counterparts of the C structs. Then I wrote a simple driver to create small matrix, call the Gauss Factorization method, and display the U matrix. To call the native method I decided to pass a three-dimensional integer array. The first two dimensions represent the height and width of the matrix. The third dimension is either a one- or two-element array representing the numerator and denominator of a rational number, with a one-dimensional array representing a denominator of 1. This matches well with the behavior of String.split("/").


package org.jtodd.jni;

public class JRashunal {
    private int numerator;
    private int denominator;

    public JRashunal(int numerator, int denominator) {
        this.numerator = numerator;
        this.denominator = denominator;
    }

    @Override
    public String toString() {
        if (denominator == 1) {
            return String.format("{%d}", numerator);
        } else {
            return String.format("{%d/%d}", numerator, denominator);
        }
    }
}

package org.jtodd.jni;

public class JRashunalMatrix {
    private int height;
    private int width;
    private JRashunal[] data;

    public JRashunalMatrix(int height, int width, JRashunal[] data) {
        this.height = height;
        this.width = width;
        this.data = data;
    }

    @Override
    public String toString() {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < height; ++i) {
            builder.append("[ ");
            for (int j = 0; j < width; ++j) {
                builder.append(data[i * width + j]);
                builder.append(" ");
            }
            builder.append("]\n");
        }
        return builder.toString();
    }
}

package org.jtodd.jni;

public class RMatrixJNI {
    
    static {
        System.loadLibrary("jnirmatrix");
    }

    public static void main(String[] args) {
        RMatrixJNI app = new RMatrixJNI();
        int data[][][] = {
            { { 1    }, { 2 }, { 3, 2 }, },
            { { 4, 3 }, { 5 }, { 6    }, },
        };
        JRashunalMatrix u = app.factor(data);
        System.out.println(u);
    }

    private native JRashunalMatrix factor(int data[][][]);
}

The Java code is compiled and the C header file is generated in the same step:


$ javac -cp build -h build -d build RMatrixJNI.java JRashunal.java JRashunalMatrix.java

Other blogs, including Baeldung's, show you what a JNI header file looks like, so I won't copy it all here. The most important line is the declaration of the method defined in the Java class:


JNIEXPORT jobject JNICALL Java_org_jtodd_jni_RMatrixJNI_factor
  (JNIEnv *, jobject, jobjectArray);

This is the method that you have to implement in the code you write. Include the header file generated by the Java compiler, as well as the Rashunal and RMatrix libraries. I named the C file the same as the header file generated by the compiler.


#include "rashunal.h"
#include "rmatrix.h"
#include "org_jtodd_jni_RMatrixJNI.h"

JNIEXPORT jobject JNICALL Java_org_jtodd_jni_RMatrixJNI_factor (JNIEnv *env, jobject thisObject, jobjectArray jdata)
{
  ...
}

After I got this far, Baeldung couldn't help me much anymore. I turned to the full list of functions defined in the JNI specification. This let me get the dimensions of the Java array and allocate the array of C Rashunals:


    long height = (long)(*env)->GetArrayLength(env, jdata);
    jarray first_row = (*env)->GetObjectArrayElement(env, jdata, 0);
    long width = (long)(*env)->GetArrayLength(env, first_row);

    size_t total = height * width;
    Rashunal **data = malloc(sizeof(Rashunal *) * total);

It took some fiddling, but then I figured out how to get data from the elements of the 2D array, create C Rashunals, create the C RMatrix, and factor it:


    for (size_t i = 0; i < total; ++i) {
        size_t row_index = i / width;
        size_t col_index = i % width;
        jarray row = (*env)->GetObjectArrayElement(env, jdata, row_index);
        jarray jel = (*env)->GetObjectArrayElement(env, row, col_index);
        long el_count = (long)(*env)->GetArrayLength(env, jel);
        jint *el = (*env)->GetIntArrayElements(env, jel, JNI_FALSE);
        int numerator = (int)el[0];
        int denominator = el_count == 1 ? 1 : (int)el[1];
        data[i] = n_Rashunal(numerator, denominator);
    }
    RMatrix *m = new_RMatrix(height, width, data);
    Gauss_Factorization *f = RMatrix_gelim(m);

    const RMatrix *u = f->u;
    size_t u_height = RMatrix_height(u);
    size_t u_width = RMatrix_width(u);

The really tricky part was finding the Java class and constructor definitions from within the native code. The JNI uses something called descriptors to refer to primitives and objects:

  • The descriptors for primitives are single letters: I for integer, Z for boolean, etc.
  • The descriptor for a class is the fully-qualified class name, preceded by an L and trailed by a semicolon: Lorg/jtodd/jni/JRashunal;.
  • The descriptor for an array is the primitive/class descriptor preceded by an opening bracket: [I, [Lorg/jtodd/jni/JRashunal;. Multidimensional arrays add an opening bracket for each dimension of the array.
  • The descriptor for a method is the argument descriptors in parentheses, followed by the descriptor of the return value.
    • You can see an example of this in the header file generated by the Java compiler for the Java class: it accepts a three-dimensional array of integers and returns a JRashunalMatrix, so the signature is ([[[I)Lorg/jtodd/jni/JRashunalMatrix;.
  • If a method has multiple arguments, the descriptors are concatenated with no delimiter. This caused me a lot of grief because I couldn't find any documentation about it. ChatGPT finally gave me the clue to this. It also told me a handy tool to find the method signature of a compiled class: javap -s -p fully.qualified.ClassName.
So in our native code we first need to find the class descriptions, then we need to find the constructors for those classes. The documentation for the GetMethodID says the name of the constructor is , and the return type is void (V):

    jclass j_rashunal_class = (*env)->FindClass(env, "org/jtodd/jni/JRashunal");
    jclass j_rmatrix_class = (*env)->FindClass(env, "org/jtodd/jni/JRashunalMatrix");
    jmethodID j_rashunal_constructor = (*env)->GetMethodID(env, j_rashunal_class, "", "(II)V");
    jmethodID j_rmatrix_constructor = (*env)->GetMethodID(env, j_rmatrix_class, "", "(II[Lorg/jtodd/jni/JRashunal;)V");

(Those II's look like the Roman numeral 2!)

That was the hard part. Although the syntax is ugly, allocating and populating an array of JRashunals and creating a JRashunalMatrix was pretty straightforward:


    jobjectArray j_rashunal_data = (*env)->NewObjectArray(env, u_height * u_width, j_rashunal_class, NULL);
    for (size_t i = 0; i < total; ++i) {
        const Rashunal *r = RMatrix_get(u, i / width + 1, i % width + 1);
        jobject j_rashunal = (*env)->NewObject(env, j_rashunal_class, j_rashunal_constructor, r->numerator, r->denominator);
        (*env)->SetObjectArrayElement(env, j_rashunal_data, i, j_rashunal);
        free((Rashunal *)r);
    }
    jobject j_rmatrix = (*env)->NewObject(env, j_rmatrix_class, j_rmatrix_constructor, RMatrix_height(u), RMatrix_width(u), j_rashunal_data);

Compiling, linking, and running

Up to now I've assumed you understand the basics of C syntax, compiling, linking, and running. I won't assume that for the rest of this because it got pretty tricky and took me a while to figure it out.

I've laid out my project like this:


$ tree .
.
├── JRashunal.java
├── JRashunalMatrix.java
├── RMatrixJNI.java
├── build
│   ├── all generated and compiled code
└── org_jtodd_jni_RMatrixJNI.c

4 directories, 31 files

I set `JAVA_HOME` to the root of the Java 11 SDK I'm using. To compile the C file:


$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64
$ cc -c -fPIC \
  -Ibuild \
  -I${JAVA_HOME}/include \
  -I${JAVA_HOME}/include/linux \
  org_jtodd_jni_RMatrixJNI.c \
  -o build/org_jtodd_jni_RMatrixJNI.o

Adjust the includes to find the JNI header files for your platform. If you installed Rashunal and RMatrix to a recognized location (/usr/local/include for me) the compiler should find them on its own. If not, add includes to them as well.


$ cc -shared -fPIC -o build/libjnirmatrix.so build/org_jtodd_jni_RMatrixJNI.o -L/usr/local/lib -lrashunal -lrmatrix -lc

To create the shared library you have to link in the Rashunal and RMatrix libraries, hence the additional link location and link switches.


$ LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH java -cp build \
  -Djava.library.path=/home/john/workspace/JavaJNI/build org.jtodd.jni.RMatrixJNI
[ {1} {2} {3/2} ]
[ {0} {1} {12/7} ]

This is the tricky one. Since we created a shared library (libjnirmatrix.so), we need to provide the runtime the path to the linked Rashunal and RMatrix libraries. This isn't done through the java.library.path variable; that tells the JVM where to find the JNI header file. You need the system-specific load path to tell the system runtime (not the JVM runtime) where to find the linked libraries. On Linux and MacOS that's the LD_LIBRARY_PATH variable. Thanks again, ChatGPT!

Whew, that's a lot of work! And who wants to type those absurdly long CLI commands?

Doing it in a modern build system: Gradle

I'm continuing in my Ubuntu WSL shell with Java 11 and Gradle 8.3.

$ mkdir jrmatrix
$ gradle init
# I chose a basic project with Groovy as the DSL and the new APIs

Apparently Gradle's Java and native plugins don't play nicely in the same project, so the first thing I did was separate the project into app (Java) and native (C) subprojects. All of the automatically-generated files could stay the way they were. I just needed to make a small change to settings.gradle:


rootProject.name = 'jrmatrix'

include('app', 'native')

Then I made folders for each subproject.

In the app subproject I created the typical Java folder structure:


$ tree .
.
├── build.gradle
└── src
    └── main
        └── java
            └── org
                └── jtodd
                    └── jni
                        ├── JRashunal.java
                        ├── JRashunalMatrix.java
                        ├── RMatrixJNI.java
                        └── jrmatrix
                            └── App.java

26 directories, 11 files

The build.gradle file in app needed switches to tell the Java compiler to generate the JNI header file and where to put it. This is no longer a separate command (javah), but is an additional switch on the javac command. In addition, I wanted the run task to depend on the native compilation and linking steps I'll describe later in the native subproject.


plugins {
    id 'application'
    id 'java'
}

tasks.named("compileJava") {
    def headerDir = file("${buildDir}/generated/jni")
    options.compilerArgs += ["-h", headerDir.absolutePath]
    outputs.dir(headerDir)
}

application {
    mainClass = 'org.jtodd.jni.jrmatrix.App'
    applicationDefaultJvmArgs = [
        "-Djava.library.path=" + project(":native").layout.buildDirectory.dir("libs/shared").get().asFile.absolutePath
    ]
}

tasks.named("run") {
    dependsOn project(":native").tasks.named("linkJni")
}

Most of the fun was in the native project. I set it up with a similar folder structure to Java projects:


$ tree .
.
├── build.gradle
└── src
    └── main
        └── c
            └── org_jtodd_jni_RMatrixJNI.c

6 directories, 4 files

Gradle has a cpp-library plugin, but it seems tailored just to C++, not to C. There is also a c-library plugin, but that seems not to be bundled with Gradle 8.3, so I decided to skip it. The other alternative was a bare-bones c plugin, which apparently is pretty basic. Much of the code will look similar to what I did earlier at the command line.

Like I did there, I had to write separate steps to compile and link the implementation of the JNI header file with the Rashunal and RMatrix libraries. After a couple of refactorings to pull out common definitions of the includes and not hardcoding the names of C source files I wound up with this:


apply plugin: 'c'

def jniHeaders = { project(":app").layout.buildDirectory.dir("generated/jni").get().asFile }
def jvmHome = { file(System.getenv("JAVA_HOME")) }
def outputDir = { file("$buildDir/libs/shared") }

def osSettings = {
    def os = org.gradle.nativeplatform.platform.internal.DefaultNativePlatform.currentOperatingSystem
    def baseInclude = new File(jvmHome(), "/include")
    def includeOS
    def libName
    if (os.isLinux()) {
        includeOS = new File(jvmHome(), "/include/linux")
        libName = "libjnirmatrix.so"
    } else if (os.isMacOsX()) {
        includeOS = new File(jvmHome(), "/include/darwin")
        libName = "libjnirmatrix.dylib"
    } else if (os.isWindows()) {
        includeOS = new File(jvmHome(), "/include/win32")
        libName = "jnirmatrix.dll"
    } else if (os.isFreeBSD()) {
        includeOS = new File(jvmHome(), "/include/freebsd")
        libName = "libjnirmatrix.so"
    } else {
        throw new GradleException("Unsupported OS: $os")
    }
    [baseInclude, includeOS, libName]
}

def sourceDir = file("src/main/c")
def cSources = fileTree(dir: sourceDir, include: "**/*.c")
def objectFiles = cSources.files.collect { file ->
    new File(outputDir(), file.name.replaceAll(/\.c$/, ".o")).absolutePath
}

tasks.register('compileJni', Exec) {
    dependsOn project(":app").tasks.named("compileJava")
    outputs.dir outputDir()
    doFirst { outputDir().mkdirs() }

    def (baseInclude, includeOS, _) = osSettings()

    def compileArgs = cSources.files.collect { file ->
        [
            '-c',
            '-fPIC',
            '-I', jniHeaders().absolutePath,
            '-I', baseInclude.absolutePath,
            '-I', includeOS.absolutePath,
            file.absolutePath,
            '-o', new File(outputDir(), file.name.replaceAll(/\.c$/, ".o")).absolutePath
        ]
    }.flatten()

    commandLine 'gcc', *compileArgs
}

tasks.register('linkJni', Exec) {
    dependsOn tasks.named("compileJni")
    outputs.dir outputDir()
    doFirst { outputDir().mkdirs() }

    def (baseInclude, includeOS, libName) = osSettings()

    commandLine 'gcc',
        '-shared',
        '-fPIC',
        '-o', new File(outputDir(), libName).absolutePath,
        *objectFiles,
        '-I', jniHeaders().absolutePath,
        '-I', baseInclude.absolutePath,
        '-I', includeOS.absolutePath,
        '-L', '/usr/local/lib',
        '-l', 'rashunal',
        '-l', 'rmatrix',
        '-Wl,-rpath,/usr/local/lib'
}

tasks.named('build') {
    dependsOn tasks.named('compileJni')
    dependsOn tasks.named('linkJni')
}

Gradle subprojects have references to each other, so this library can get references to app's output directory to reference the JNI header file. The compileJni task is set to depend on app's compileJava task, and native's build task is set to depend on the compileJni and linkJni tasks defined in this file.

This worked if I explicitly called app's compileJava task and native's build task, but it failed after a clean task. It turned out Java's compile task wouldn't detect the deletion of the JNI header file as a change that required rebuilding, so I added the build directory as an output to the task (outputs.dir(headerDir)). Thus deleting that file (or cleaning the project) caused recompilation and rebuilding.

The nice thing is that this runs with a single command now (`./gradlew run`). Much nicer than entering all the command line commands by hand!

Reflection

As expected, this works but is very fragile. Particularly calling Java code from the native code depends on knowledge of the class and method signatures. If those change in the Java code, the project will compile and start just fine, but blow up pretty explosively and nastily with unclear explanations at runtime.

I was surprised by Gradle's basic tooling for C projects. I thought there would be more help than paralleling the command line so closely. I'll have to look into the `c-library` plugin to see if it offers any more help. I'm also surprised by how few blogs and Stack Overflow posts I found about this: apparently this isn't something very many people do (or live to tell the tale!).

https://github.com/proftodd/GoingNative/tree/main/jrmatrix

Thursday, September 4, 2025

Going Native - Calling Native Code

For a long time I've been interested in methods to call native code (which I've learned almost always means C and/or C++) from modern languages. Ever since I started working with Java I heard about something called JNI and that it could be used to unlock ancient mysteries hidden away in native code. At the end of this summer I had some time and finally decided to give it a try.

The native code

Another subject I've been interested in for a long time is linear algebra, specifically matrices and their uses. To try to get a better understanding of them I wrote a couple of small C libraries using them. My linear algebra guide is Linear Algebra and Its Applications (3rd ed) by Gilbert Strang.

One is a basic library to handle the arithmetic of rational numbers: Rashunal.

The other is a library to handle matrices of Rashunals and some basic operations on them (add, multiply, row operations, Gauss factoring). Todos include matrix inversion, minors, calculation of determinants. Finding eigenvalues and eigenvectors requires solving high-degree polynmial equations, so I'm not sure I'll get to that. RMatrix.

I wrote these to the best of my ability using CMake, Unity, and Valgrind. They should be cross-platform and installable on any system (Linux,MacOS, Windows). They have basic unit tests and have been checked for memory integrity. However, they should not be used for any production environment. Do not build your great AI model or statistics package on them!

Prerequisites and Setup

  • CMake
  • Make - Installation is different on different systems
    • Linux - almost certainly available on your distribution's package installer
      • Debian - sudo apt update && sudo apt install make
      • Red Hat - sudo yum update && sudo yum install make
    • MacOS - brew install make
    • Windows - choco install make
  • Your favorite C compiler
Clone the repositories above. The READMEs have instructions on how to compile and install the library code. Briefly:

$ cd Rashunal
$ cmake -S . -B build
$ cd build
$ make && make test && make install
$ cd ..
$ cd rmatrix
$ cmake -S . -B build
$ cd build
$ make && make test && make install

Approach

In each language I target, I'd like to write an application that can read a data file representing a matrix of rational numbers, send it to the native code, do something fairly complicated with it, and return it to the caller in some useful form. To demonstrate I wrote a model application in the RMatrix repository. It reads a matrix in a simple space-delimited format, converts it to an RMatrix, factors it into P-1, L, D, and U matrices, and displays the L and U matrices in the terminal window.


$ cd driver
$ cat example.txt
-2  1/3 -3/4
 6 -1    8
 8  3/2 -7
$ make
$ cat example.txt | ./driver
Height is 3
Width is 3
L:
[  1 0 0 ]
[ -3 1 0 ]
[ -4 0 1 ]
U:
[ 1 -1/6  3/8  ]
[ 0  1   60/17 ]
[ 0  0    1    ]

Next steps

In further entries in this series I'll demonstrate calling this library from several modern, widely-used, high-level languages. The ones I'm planning to target include:

  • Java
  • C#
  • Python
  • (maybe) Swift

But who knows how deep this rabbit hole will go?

Sunday, October 20, 2019

Reactions to edges 2: Breaking reactions down into edges

Once I had the reactant atom maps, breaking reactions down into edges became a simple matter of generating all pairs of reactants-products and then filtering them based on whether at least one atom of the reactant was mapped to the product:
def _splitReaction(rxn: IReaction, rctMap: Map[IAtom, IAtom]): List[IReaction] = {
  val rctPrdPairs: Iterable[(IAtomContainer, IAtomContainer)] = for (
    rct <- rxn.getReactants.atomContainers.asScala;
    prd <- rxn.getProducts.atomContainers.asScala
  ) yield (rct, prd)
  val mappedPairs: Iterable[(IAtomContainer, IAtomContainer)] = rctPrdPairs.filter({ case (rct, prd) => {
    rctMap.isEmpty ||
    rct.atoms.asScala.exists(r => rctMap.get(r) match {
      case None => false
      case Some(null) => false
      case Some(p) => p.getContainer == prd
    })
  } })
  mappedPairs.map({ case (rct, prd) => {
    val newRxn: IReaction = new Reaction()
    newRxn.addReactant(AtomContainerManipulator.copyAndSuppressedHydrogens(rct))
    newRxn.addProduct(AtomContainerManipulator.copyAndSuppressedHydrogens(prd))
    newRxn.setProperty(CDKConstants.SMILES, sg.create(newRxn))
    newRxn
  }}).toList
I remember being astonished when I worked this out. The first time I did this in Java it was a LOT harder and took a LOT more lines of code. Part of that was working in Java, but this time I also made much better use of the CDK interface to create a new reaction and clone the reactant and product before adding them to it. I guess I have grown as a programmer in the intervening years! I added the SMILES of the edge to the emitted edge to aid in visualization and debugging later. Later I generating the InChIKey of the reactant and product as well to group identical edges into single records in Spark.

I added the escape valve of an empty reactant map. To preserve information I include all combinations of reactants and products.

I also have to do a check on the image of the reactant atom in the reactant map. The way I constructed the map, the reactant atom could be mapped to None or to Some(null), which caused exceptions when I ran this over the whole NextMove reaction set.

CDK's IsomorphismTester made checking the correctness of the splitting much easier than going atom-by-atom:

test("it maps all reactants to all products if no map is present") {
  val rxn: IReaction = sp.parseReactionSmiles(unmappedEsterSmiles)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 2)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
}

test("it uses maps to correctly split reactions") {
  val rxn: IReaction = sp.parseReactionSmiles(chiralResolution)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 4)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  // DL-alanine to L-alanine
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  // DL-alanine to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(2).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(2).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to acetic acid
  assert(new IsomorphismTester(edges(3).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(3).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(2)))
}

Wednesday, October 9, 2019

Reactions to edges 1: Massaging the mapping information

The biggest problem in any programming project is deciding how to represent the data. The data structure that best models a set of chemical reactions is a hypergraph, similar to a graph, but each directed edge can link multiple nodes on either end. That sounded too complicated to analyze to me. Back when I worked at the large chemical information company, I had the idea to break reactions down into something I called edges, each with a single reactant and a single product, which could be modeled as a graph. Moreover, it would be easy to compare individual reactants and products, compared to sets of them, so I could group together identical edges coming from different reactions. This will be important later on.

So, if you have a chemical reaction that looks like this:

A + B -> C + D

A simple way to break this down into edges would be to take every combination of reactants and products:

A -> C A -> D
B -> C B -> D
But I thought I could do better. Reactions can be enhanced with a reaction map, which describes which reactant atoms wind up as which product atoms. Thus, you can decide which reactant substances are directly related to which product substances. In other words, if A breaks apart and contributes atoms to both C and D, but B only contributes to D, the B -> C edge can be dropped. Fortunately, the NextMove patent reaction data set comes with reaction maps for most, if not all, of the reactions.

Now the CDK has a Mappings class, but it is not automatically generated when you parse a smiles string into a reaction. Instead, each mapped atom receives a property (CDKConstants.ATOM_ATOM_MAPPING) whose value is an integer that is the same in reactant and product. So my first job was to massage those properties into something I could use.

import scala.collection.mutable
import scala.jdk.CollectionConverters._

def getMaps(rxn: IReaction): (Map[Int, Mapping], Map[IAtom, IAtom], Map[IAtom, IAtom]) = {
  val atomMap: mutable.Map[Int, Mapping] = mutable.Map[Int, Mapping]()
  val rctMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  val prdMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  for (rct <- rxn.getReactants.atomContainers.asScala) {
    for (atom <- rct.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        atomMap.put(atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int], new Mapping(atom, null))
      }
      rctMap.put(atom, null)
    }
  }
  for (prd <- rxn.getProducts.atomContainers.asScala) {
    for (atom <- prd.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        val mapNumber: Int = atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int]
        if (atomMap.contains(mapNumber)) {
          val oldMapping: Mapping = atomMap(mapNumber)
          val rctAtom: IAtom = oldMapping.getChemObject(0).asInstanceOf[IAtom]
          val newMapping: Mapping = new Mapping(rctAtom, atom)
          atomMap.put(mapNumber, newMapping)
          rctMap.put(rctAtom, atom)
          prdMap.put(atom, rctAtom)
        } else {
          atomMap.put(mapNumber, new Mapping(null, atom))
          prdMap.put(atom, null)
        }
      } else {
        prdMap.put(atom, null)
      }
    }
  }
  (atomMap.toMap, rctMap.toMap, prdMap.toMap)
}
It took me the longest time to figure out how to convert a Scala collection to a Java collection, which is what the CDK expects, since the interface has changed a couple of times recently. As far as I can tell, this is the current way to do it in Scala 2.13.

This is a pretty typical, straightforward way to do it using mutable maps, iterating over the reactant and product substance and atom collections, and consulting the properties to see if this atom has been mapped. I use the Mappings class to hold mapped reactant and product atoms. I wasn't sure at this point which map would be most useful, so I generated maps of mapping numbers to reactant-product pairs, maps of reactant atoms to product atoms, and maps of product atoms to reactant atoms.

When iterating over the reactant atoms I don't yet know which product atom will correspond to the reactant atom. So I generate a set of reactant atom -> null Mappings. Then, when I'm iterating over the product atoms, when I find a match, I replace that Mapping with the correct reactant atom -> product atom Mapping. If there's no match, then I create a null -> product atom Mapping.

Well, that worked, but it's not very Scalaic (the Scala counterpart to Pythonic?). Specifically, all that explicit iteration is ugg-lee. So here's what I refactored it to:

def getMaps(rxn: IReaction): Map[IAtom, IAtom] = {
  val rctMapNumbers: Map[Int, IAtom] = rxn.getReactants.atomContainers.asScala.flatMap(r => r.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val prdMapNumbers: Map[Int, IAtom] = rxn.getProducts.atomContainers.asScala.flatMap(p => p.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val atomMap: Map[Int, IMapping] = (rctMapNumbers.keySet ++ prdMapNumbers.keySet).map(n => {
    (n, new Mapping(rctMapNumbers.getOrElse(n, null), prdMapNumbers.getOrElse(n, null)))
  }).toMap
  val rctMap: Map[IAtom, IAtom] = rctMapNumbers
    .map({ case (n, a) => (a, atomMap.getOrElse(n, new Mapping(a, null))) })
    .map({ case (a: IAtom, m: IMapping) => (a, m.getChemObject(1).asInstanceOf[IAtom]) })
  rctMap
}
Generating the maps of numbers to reactant and product atoms is repetitive and could be pulled out into a method, but I usually don't start worrying about that until I do it more than twice. I like using flatMap, not only to generate a single list of atoms from the reactants and products, but also to essentially filter out the atoms that aren't mapped by zipping the map number with Some(a). I also like the trick of combining the keys of rctMapNumbers and prdMapNumbers into a single deduplicated set.

To test it I picked a simple reaction and created the expected map explicitly, and compared it atom by atom to what the method generated.

val esterSmiles = "[CH3:5][CH2:6][OH:7].[CH3:1][C:2](=[O:3])[OH:4]>[H+]>[CH3:5][CH2:6][O:7][C:2](=[O:3])[CH3:1].[OH2:4] Ethyl esterification [1.7.3]"

test("it produces correct mappings") {
  val rxn: IReaction = sp.parseReactionSmiles(esterSmiles)
  val ethanol: IAtomContainer = rxn.getReactants.getAtomContainer(0)
  val aceticAcid: IAtomContainer = rxn.getReactants.getAtomContainer(1)
  val ethylAcetate: IAtomContainer = rxn.getProducts.getAtomContainer(0)

  val expectedRctMap: Map[IAtom, IAtom] = Map(
    aceticAcid.getAtom(0) -> ethylAcetate.getAtom(5),
    aceticAcid.getAtom(1) -> ethylAcetate.getAtom(3),
    aceticAcid.getAtom(2) -> ethylAcetate.getAtom(4),
    aceticAcid.getAtom(3) -> null,
    ethanol.getAtom(0) -> ethylAcetate.getAtom(0),
    ethanol.getAtom(1) -> ethylAcetate.getAtom(1),
    ethanol.getAtom(2) -> ethylAcetate.getAtom(2)
  )

  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap: Map[IAtom, IAtom] = Splitter.getMaps(rxn)

  assert(expectedRctMap == rctMap)
}
So now I have a map of reactant atoms to product atoms. Next time I'll show how I used that to split reactions into edges of mapped reactant-product pairs.

Sunday, September 29, 2019

My current project: A reaction graph - purpose and start

Chemical reactions have always been what fascinated me most about chemistry: stoichiometry in high school, organic reaction mechanisms in college, and transformations later on. When I was employed at a large chemical information company I learned about graph theory and treating sets of organic reactions as a graph. I loved the idea of processing a set of chemical reactions so that you could immediately know whether it was possible to get from any reactant to any product. While I was at that large chemical information company I worked out an algorithm to do it in MapReduce. Calculating the full transitive closure was impossible even on the cluster I had access to, but I always wondered if implementing it in Spark and Scala would make it feasible. A year or so ago I heard about the NextMove patent reaction database, so I decided to see if I could make it work using what I had learned in the meantime.

I downloaded the reaction set and took a look. The reactions are recorded in SMILES with reaction mapping. Fantastic! That's just what my algorithm needed.

I use the Chemistry Development Kit to process chemical information in my code. Spark is natively implemented in Scala, with alternate interfaces in Java and Python. I thought Scala would handle the abstract concepts I needed and do the processing best out of the three.

The first step was to parse out a reaction object from the SMILES string. One of the nice things about Scala is that it runs on the JVM, so you can import and use Java objects easily. So this one was simple:

var sp: SmilesParser = new SmilesParser(SilentChemObjectBuilder.getInstance())

def parseSmiles(smiles: String): IReaction  = sp.parseReactionSmiles(smiles)
Lastly, I am unashamed to be completely biased toward organic substances, so I wanted to filter out any inorganic compounds, even if they happened to be mapped in the NextMove reaction. This was my first approach:
    val rctIterator = rxn.getReactants.atomContainers.iterator
    while (rctIterator.hasNext) {
      val rct: IAtomContainer = rctIterator.next
      val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(rct)
      if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
        rctIterator.remove
      }
    }
and something similar for the products. I then refactored that into a method that I call twice, once for the reactants and again for the products.
  def filterInorganicsFromReaction(rxn: IReaction): Unit = {
    filterInorganicsFromMolListIterator(rxn.getReactants.atomContainers.iterator)
    filterInorganicsFromMolListIterator(rxn.getProducts.atomContainers.iterator)
  }

  def filterInorganicsFromMolListIterator(subIt: java.util.Iterator[IAtomContainer]): Unit = {
    while (subIt.hasNext) {
      val sub: IAtomContainer = subIt.next
      val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(sub)
      if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
        subIt.remove
      }
    }
  }

Sunday, September 22, 2019

Installing OpenBabel on my Mac

I have been interested in chemistry since high school, and programming since grad school. I dream sometimes about getting a job in cheminformatics or chemical information. To do that, I probably need practice, so I decided to install OpenBabel. This is a library of C functions to do interesting chemistry-related things, like convert between various chemistry file formats, cluster similar compounds, calculate descriptors, etc. It has a set of Python bindings which makes it a lot easier to work with. Sounds like fun, doesn't it?
  • Documentation: http://openbabel.org/docs/current/index.html
  • Compilation and installation instructions: http://openbabel.org/docs/current/Installation/install.html

Preliminaries

  • Downloaded and extracted a tar.gz file from http://sourceforge.net/projects/openbabel/files/openbabel/2.3.1/openbabel-2.3.1.tar.gz/download
  • Tried an out-of-source build, and CMake complained about missing prerequisites

Prerequisites

  • Eigen
    • OpenBabel says it needs version 2, but brew only knows about version 3.x. I installed version 3 anyway.
    • This created an entry in /usr/local/include/eigen3 which pointed to ../Cellar/eigen/3.3.7/include/eigen3 which contains the Eigen and unsupported folders.
    • Eigen is a header-only project, so I cloned it from its Github mirror and checked out branch 2.0.17.
    • Then I created a symlink from /usr/local/include/eigen2 to ../Cellar/eigen/2.0.17/include/eigen2 and copied the Eigen and unsupported folders there.
    • Might confuse me if I ever try to brew uninstall Eigen 2, but c'est la vie.
  • wxWidgets
    • The first two Google hits didn't do it. Finally I searched `wxwidgets mac homebrew`, and sure enough, there's a formula:
    • brew install wxmac
  • Cairo
    • Did this out of order. The error stack mentioned Cairo, so I first tried installing it with Homebrew:
    • brew install cairo
    • After all this, CMake still gave an error, but it seemed related to pkg-config. Another quick Google search fixed that:
    • brew install pkg-config
      Which, if I recall, is a good thing to have for general C development.
    • CMake was still complaining that Cairo wasn't found, despite the presence of `cairo` in `/usr/local/include`. Looking more closely at the output I saw "Package 'libffi', required by 'gobject-2.0', not found"
    • Another Google search on that turned up setting the path to libffi as:
    • PKG_CONFIG_PATH="/usr/local/opt/libffi/lib/pkgconfig" cmake -S . -B build
After that the CMake build completed without error finding all the packages I thought I would need.

Compilation and installation

Since I want to use this in Python I attempted to set the Python bindings. I wasn't finding the Python 3 libraries, but I finally found a web page that said I could point OpenBabel to Python 3 instead of 2 by giving the path to the executable. For the heck of it I also built the GUI. Being a good TDD person I also wanted to run the tests. My final CMake command was:
PKG_CONFIG_PATH="/usr/local/opt/libffi/lib/pkgconfig" cmake -S . -B build -Wno-dev -DBUILD_GUI=ON -DPYTHON_BINDINGS=ON -DPYTHON_EXECUTABLE=/usr/local/bin/python3 -DENABLE_TESTS=ON
make -C build
make test -C build
That caused a problem with a missing header. Searching for a solution to that, I found that there's a formula for OpenBabel, which would allow me to bypass all this dependency installation by hand stuff.
brew install open-babel
Which went without hitch. But just to be thorough, it also suggested cloning and building from Github. When I did that:
git clone git@github.com:openbabel/openbabel.git
and used the CMake and make commands, this time there was no problem with the missing header. But there were no Python bindings. Looking at the build output, I found I needed to add -DRUN_SWIG=ON, and that that failed without SWIG installed. So brew install swig and then the Python bindings were generated.
$ python3
>>> import openbabel
>>>
QED

Introduction and Purpose

I'm trying blogging again! I do some hobby programming at home, and unlike work, I don't have any place convenient to keep my notes. So that's what this will be, a place for me to keep notes and write explanations to other programmers about problems I encounter and how I overcome them. The other programmers may be just me, but hey, that's okay. My company uses pair programming extensively, but as an introvert, there are times when I just like to try things on my own. So that's where the title of the blog comes from: just a solitary programmer, trying to do things on his own and writing about it. I hope you (and I) learn something. Happy coding!