Sunday, October 20, 2019

Reactions to edges 2: Breaking reactions down into edges

Once I had the reactant atom maps, breaking reactions down into edges became a simple matter of generating all pairs of reactants-products and then filtering them based on whether at least one atom of the reactant was mapped to the product:
def _splitReaction(rxn: IReaction, rctMap: Map[IAtom, IAtom]): List[IReaction] = {
  val rctPrdPairs: Iterable[(IAtomContainer, IAtomContainer)] = for (
    rct <- rxn.getReactants.atomContainers.asScala;
    prd <- rxn.getProducts.atomContainers.asScala
  ) yield (rct, prd)
  val mappedPairs: Iterable[(IAtomContainer, IAtomContainer)] = rctPrdPairs.filter({ case (rct, prd) => {
    rctMap.isEmpty ||
    rct.atoms.asScala.exists(r => rctMap.get(r) match {
      case None => false
      case Some(null) => false
      case Some(p) => p.getContainer == prd
    })
  } })
  mappedPairs.map({ case (rct, prd) => {
    val newRxn: IReaction = new Reaction()
    newRxn.addReactant(AtomContainerManipulator.copyAndSuppressedHydrogens(rct))
    newRxn.addProduct(AtomContainerManipulator.copyAndSuppressedHydrogens(prd))
    newRxn.setProperty(CDKConstants.SMILES, sg.create(newRxn))
    newRxn
  }}).toList
I remember being astonished when I worked this out. The first time I did this in Java it was a LOT harder and took a LOT more lines of code. Part of that was working in Java, but this time I also made much better use of the CDK interface to create a new reaction and clone the reactant and product before adding them to it. I guess I have grown as a programmer in the intervening years! I added the SMILES of the edge to the emitted edge to aid in visualization and debugging later. Later I generating the InChIKey of the reactant and product as well to group identical edges into single records in Spark.

I added the escape valve of an empty reactant map. To preserve information I include all combinations of reactants and products.

I also have to do a check on the image of the reactant atom in the reactant map. The way I constructed the map, the reactant atom could be mapped to None or to Some(null), which caused exceptions when I ran this over the whole NextMove reaction set.

CDK's IsomorphismTester made checking the correctness of the splitting much easier than going atom-by-atom:

test("it maps all reactants to all products if no map is present") {
  val rxn: IReaction = sp.parseReactionSmiles(unmappedEsterSmiles)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 2)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
}

test("it uses maps to correctly split reactions") {
  val rxn: IReaction = sp.parseReactionSmiles(chiralResolution)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 4)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  // DL-alanine to L-alanine
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  // DL-alanine to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(2).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(2).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to acetic acid
  assert(new IsomorphismTester(edges(3).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(3).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(2)))
}

Wednesday, October 9, 2019

Reactions to edges 1: Massaging the mapping information

The biggest problem in any programming project is deciding how to represent the data. The data structure that best models a set of chemical reactions is a hypergraph, similar to a graph, but each directed edge can link multiple nodes on either end. That sounded too complicated to analyze to me. Back when I worked at the large chemical information company, I had the idea to break reactions down into something I called edges, each with a single reactant and a single product, which could be modeled as a graph. Moreover, it would be easy to compare individual reactants and products, compared to sets of them, so I could group together identical edges coming from different reactions. This will be important later on.

So, if you have a chemical reaction that looks like this:

A + B -> C + D

A simple way to break this down into edges would be to take every combination of reactants and products:

A -> C A -> D
B -> C B -> D
But I thought I could do better. Reactions can be enhanced with a reaction map, which describes which reactant atoms wind up as which product atoms. Thus, you can decide which reactant substances are directly related to which product substances. In other words, if A breaks apart and contributes atoms to both C and D, but B only contributes to D, the B -> C edge can be dropped. Fortunately, the NextMove patent reaction data set comes with reaction maps for most, if not all, of the reactions.

Now the CDK has a Mappings class, but it is not automatically generated when you parse a smiles string into a reaction. Instead, each mapped atom receives a property (CDKConstants.ATOM_ATOM_MAPPING) whose value is an integer that is the same in reactant and product. So my first job was to massage those properties into something I could use.

import scala.collection.mutable
import scala.jdk.CollectionConverters._

def getMaps(rxn: IReaction): (Map[Int, Mapping], Map[IAtom, IAtom], Map[IAtom, IAtom]) = {
  val atomMap: mutable.Map[Int, Mapping] = mutable.Map[Int, Mapping]()
  val rctMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  val prdMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  for (rct <- rxn.getReactants.atomContainers.asScala) {
    for (atom <- rct.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        atomMap.put(atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int], new Mapping(atom, null))
      }
      rctMap.put(atom, null)
    }
  }
  for (prd <- rxn.getProducts.atomContainers.asScala) {
    for (atom <- prd.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        val mapNumber: Int = atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int]
        if (atomMap.contains(mapNumber)) {
          val oldMapping: Mapping = atomMap(mapNumber)
          val rctAtom: IAtom = oldMapping.getChemObject(0).asInstanceOf[IAtom]
          val newMapping: Mapping = new Mapping(rctAtom, atom)
          atomMap.put(mapNumber, newMapping)
          rctMap.put(rctAtom, atom)
          prdMap.put(atom, rctAtom)
        } else {
          atomMap.put(mapNumber, new Mapping(null, atom))
          prdMap.put(atom, null)
        }
      } else {
        prdMap.put(atom, null)
      }
    }
  }
  (atomMap.toMap, rctMap.toMap, prdMap.toMap)
}
It took me the longest time to figure out how to convert a Scala collection to a Java collection, which is what the CDK expects, since the interface has changed a couple of times recently. As far as I can tell, this is the current way to do it in Scala 2.13.

This is a pretty typical, straightforward way to do it using mutable maps, iterating over the reactant and product substance and atom collections, and consulting the properties to see if this atom has been mapped. I use the Mappings class to hold mapped reactant and product atoms. I wasn't sure at this point which map would be most useful, so I generated maps of mapping numbers to reactant-product pairs, maps of reactant atoms to product atoms, and maps of product atoms to reactant atoms.

When iterating over the reactant atoms I don't yet know which product atom will correspond to the reactant atom. So I generate a set of reactant atom -> null Mappings. Then, when I'm iterating over the product atoms, when I find a match, I replace that Mapping with the correct reactant atom -> product atom Mapping. If there's no match, then I create a null -> product atom Mapping.

Well, that worked, but it's not very Scalaic (the Scala counterpart to Pythonic?). Specifically, all that explicit iteration is ugg-lee. So here's what I refactored it to:

def getMaps(rxn: IReaction): Map[IAtom, IAtom] = {
  val rctMapNumbers: Map[Int, IAtom] = rxn.getReactants.atomContainers.asScala.flatMap(r => r.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val prdMapNumbers: Map[Int, IAtom] = rxn.getProducts.atomContainers.asScala.flatMap(p => p.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val atomMap: Map[Int, IMapping] = (rctMapNumbers.keySet ++ prdMapNumbers.keySet).map(n => {
    (n, new Mapping(rctMapNumbers.getOrElse(n, null), prdMapNumbers.getOrElse(n, null)))
  }).toMap
  val rctMap: Map[IAtom, IAtom] = rctMapNumbers
    .map({ case (n, a) => (a, atomMap.getOrElse(n, new Mapping(a, null))) })
    .map({ case (a: IAtom, m: IMapping) => (a, m.getChemObject(1).asInstanceOf[IAtom]) })
  rctMap
}
Generating the maps of numbers to reactant and product atoms is repetitive and could be pulled out into a method, but I usually don't start worrying about that until I do it more than twice. I like using flatMap, not only to generate a single list of atoms from the reactants and products, but also to essentially filter out the atoms that aren't mapped by zipping the map number with Some(a). I also like the trick of combining the keys of rctMapNumbers and prdMapNumbers into a single deduplicated set.

To test it I picked a simple reaction and created the expected map explicitly, and compared it atom by atom to what the method generated.

val esterSmiles = "[CH3:5][CH2:6][OH:7].[CH3:1][C:2](=[O:3])[OH:4]>[H+]>[CH3:5][CH2:6][O:7][C:2](=[O:3])[CH3:1].[OH2:4] Ethyl esterification [1.7.3]"

test("it produces correct mappings") {
  val rxn: IReaction = sp.parseReactionSmiles(esterSmiles)
  val ethanol: IAtomContainer = rxn.getReactants.getAtomContainer(0)
  val aceticAcid: IAtomContainer = rxn.getReactants.getAtomContainer(1)
  val ethylAcetate: IAtomContainer = rxn.getProducts.getAtomContainer(0)

  val expectedRctMap: Map[IAtom, IAtom] = Map(
    aceticAcid.getAtom(0) -> ethylAcetate.getAtom(5),
    aceticAcid.getAtom(1) -> ethylAcetate.getAtom(3),
    aceticAcid.getAtom(2) -> ethylAcetate.getAtom(4),
    aceticAcid.getAtom(3) -> null,
    ethanol.getAtom(0) -> ethylAcetate.getAtom(0),
    ethanol.getAtom(1) -> ethylAcetate.getAtom(1),
    ethanol.getAtom(2) -> ethylAcetate.getAtom(2)
  )

  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap: Map[IAtom, IAtom] = Splitter.getMaps(rxn)

  assert(expectedRctMap == rctMap)
}
So now I have a map of reactant atoms to product atoms. Next time I'll show how I used that to split reactions into edges of mapped reactant-product pairs.

Sunday, September 29, 2019

My current project: A reaction graph - purpose and start

Chemical reactions have always been what fascinated me most about chemistry: stoichiometry in high school, organic reaction mechanisms in college, and transformations later on. When I was employed at a large chemical information company I learned about graph theory and treating sets of organic reactions as a graph. I loved the idea of processing a set of chemical reactions so that you could immediately know whether it was possible to get from any reactant to any product. While I was at that large chemical information company I worked out an algorithm to do it in MapReduce. Calculating the full transitive closure was impossible even on the cluster I had access to, but I always wondered if implementing it in Spark and Scala would make it feasible. A year or so ago I heard about the NextMove patent reaction database, so I decided to see if I could make it work using what I had learned in the meantime.

I downloaded the reaction set and took a look. The reactions are recorded in SMILES with reaction mapping. Fantastic! That's just what my algorithm needed.

I use the Chemistry Development Kit to process chemical information in my code. Spark is natively implemented in Scala, with alternate interfaces in Java and Python. I thought Scala would handle the abstract concepts I needed and do the processing best out of the three.

The first step was to parse out a reaction object from the SMILES string. One of the nice things about Scala is that it runs on the JVM, so you can import and use Java objects easily. So this one was simple:

var sp: SmilesParser = new SmilesParser(SilentChemObjectBuilder.getInstance())

def parseSmiles(smiles: String): IReaction  = sp.parseReactionSmiles(smiles)
Lastly, I am unashamed to be completely biased toward organic substances, so I wanted to filter out any inorganic compounds, even if they happened to be mapped in the NextMove reaction. This was my first approach:
    val rctIterator = rxn.getReactants.atomContainers.iterator
    while (rctIterator.hasNext) {
      val rct: IAtomContainer = rctIterator.next
      val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(rct)
      if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
        rctIterator.remove
      }
    }
and something similar for the products. I then refactored that into a method that I call twice, once for the reactants and again for the products.
  def filterInorganicsFromReaction(rxn: IReaction): Unit = {
    filterInorganicsFromMolListIterator(rxn.getReactants.atomContainers.iterator)
    filterInorganicsFromMolListIterator(rxn.getProducts.atomContainers.iterator)
  }

  def filterInorganicsFromMolListIterator(subIt: java.util.Iterator[IAtomContainer]): Unit = {
    while (subIt.hasNext) {
      val sub: IAtomContainer = subIt.next
      val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(sub)
      if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
        subIt.remove
      }
    }
  }

Sunday, September 22, 2019

Installing OpenBabel on my Mac

I have been interested in chemistry since high school, and programming since grad school. I dream sometimes about getting a job in cheminformatics or chemical information. To do that, I probably need practice, so I decided to install OpenBabel. This is a library of C functions to do interesting chemistry-related things, like convert between various chemistry file formats, cluster similar compounds, calculate descriptors, etc. It has a set of Python bindings which makes it a lot easier to work with. Sounds like fun, doesn't it?
  • Documentation: http://openbabel.org/docs/current/index.html
  • Compilation and installation instructions: http://openbabel.org/docs/current/Installation/install.html

Preliminaries

  • Downloaded and extracted a tar.gz file from http://sourceforge.net/projects/openbabel/files/openbabel/2.3.1/openbabel-2.3.1.tar.gz/download
  • Tried an out-of-source build, and CMake complained about missing prerequisites

Prerequisites

  • Eigen
    • OpenBabel says it needs version 2, but brew only knows about version 3.x. I installed version 3 anyway.
    • This created an entry in /usr/local/include/eigen3 which pointed to ../Cellar/eigen/3.3.7/include/eigen3 which contains the Eigen and unsupported folders.
    • Eigen is a header-only project, so I cloned it from its Github mirror and checked out branch 2.0.17.
    • Then I created a symlink from /usr/local/include/eigen2 to ../Cellar/eigen/2.0.17/include/eigen2 and copied the Eigen and unsupported folders there.
    • Might confuse me if I ever try to brew uninstall Eigen 2, but c'est la vie.
  • wxWidgets
    • The first two Google hits didn't do it. Finally I searched `wxwidgets mac homebrew`, and sure enough, there's a formula:
    • brew install wxmac
  • Cairo
    • Did this out of order. The error stack mentioned Cairo, so I first tried installing it with Homebrew:
    • brew install cairo
    • After all this, CMake still gave an error, but it seemed related to pkg-config. Another quick Google search fixed that:
    • brew install pkg-config
      Which, if I recall, is a good thing to have for general C development.
    • CMake was still complaining that Cairo wasn't found, despite the presence of `cairo` in `/usr/local/include`. Looking more closely at the output I saw "Package 'libffi', required by 'gobject-2.0', not found"
    • Another Google search on that turned up setting the path to libffi as:
    • PKG_CONFIG_PATH="/usr/local/opt/libffi/lib/pkgconfig" cmake -S . -B build
After that the CMake build completed without error finding all the packages I thought I would need.

Compilation and installation

Since I want to use this in Python I attempted to set the Python bindings. I wasn't finding the Python 3 libraries, but I finally found a web page that said I could point OpenBabel to Python 3 instead of 2 by giving the path to the executable. For the heck of it I also built the GUI. Being a good TDD person I also wanted to run the tests. My final CMake command was:
PKG_CONFIG_PATH="/usr/local/opt/libffi/lib/pkgconfig" cmake -S . -B build -Wno-dev -DBUILD_GUI=ON -DPYTHON_BINDINGS=ON -DPYTHON_EXECUTABLE=/usr/local/bin/python3 -DENABLE_TESTS=ON
make -C build
make test -C build
That caused a problem with a missing header. Searching for a solution to that, I found that there's a formula for OpenBabel, which would allow me to bypass all this dependency installation by hand stuff.
brew install open-babel
Which went without hitch. But just to be thorough, it also suggested cloning and building from Github. When I did that:
git clone git@github.com:openbabel/openbabel.git
and used the CMake and make commands, this time there was no problem with the missing header. But there were no Python bindings. Looking at the build output, I found I needed to add -DRUN_SWIG=ON, and that that failed without SWIG installed. So brew install swig and then the Python bindings were generated.
$ python3
>>> import openbabel
>>>
QED

Introduction and Purpose

I'm trying blogging again! I do some hobby programming at home, and unlike work, I don't have any place convenient to keep my notes. So that's what this will be, a place for me to keep notes and write explanations to other programmers about problems I encounter and how I overcome them. The other programmers may be just me, but hey, that's okay. My company uses pair programming extensively, but as an introvert, there are times when I just like to try things on my own. So that's where the title of the blog comes from: just a solitary programmer, trying to do things on his own and writing about it. I hope you (and I) learn something. Happy coding!