The Solitary Programmer: October 2019

Once I had the reactant atom maps, breaking reactions down into edges became a simple matter of generating all pairs of reactants-products and then filtering them based on whether at least one atom of the reactant was mapped to the product:

def _splitReaction(rxn: IReaction, rctMap: Map[IAtom, IAtom]): List[IReaction] = {
  val rctPrdPairs: Iterable[(IAtomContainer, IAtomContainer)] = for (
    rct <- rxn.getReactants.atomContainers.asScala;
    prd <- rxn.getProducts.atomContainers.asScala
  ) yield (rct, prd)
  val mappedPairs: Iterable[(IAtomContainer, IAtomContainer)] = rctPrdPairs.filter({ case (rct, prd) => {
    rctMap.isEmpty ||
    rct.atoms.asScala.exists(r => rctMap.get(r) match {
      case None => false
      case Some(null) => false
      case Some(p) => p.getContainer == prd
    })
  } })
  mappedPairs.map({ case (rct, prd) => {
    val newRxn: IReaction = new Reaction()
    newRxn.addReactant(AtomContainerManipulator.copyAndSuppressedHydrogens(rct))
    newRxn.addProduct(AtomContainerManipulator.copyAndSuppressedHydrogens(prd))
    newRxn.setProperty(CDKConstants.SMILES, sg.create(newRxn))
    newRxn
  }}).toList

I remember being astonished when I worked this out. The first time I did this in Java it was a LOT harder and took a LOT more lines of code. Part of that was working in Java, but this time I also made much better use of the CDK interface to create a new reaction and clone the reactant and product before adding them to it. I guess I have grown as a programmer in the intervening years! I added the SMILES of the edge to the emitted edge to aid in visualization and debugging later. Later I generating the InChIKey of the reactant and product as well to group identical edges into single records in Spark.

I added the escape valve of an empty reactant map. To preserve information I include all combinations of reactants and products.

I also have to do a check on the image of the reactant atom in the reactant map. The way I constructed the map, the reactant atom could be mapped to None or to Some(null), which caused exceptions when I ran this over the whole NextMove reaction set.

CDK's IsomorphismTester made checking the correctness of the splitting much easier than going atom-by-atom:

test("it maps all reactants to all products if no map is present") {
  val rxn: IReaction = sp.parseReactionSmiles(unmappedEsterSmiles)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 2)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
}

test("it uses maps to correctly split reactions") {
  val rxn: IReaction = sp.parseReactionSmiles(chiralResolution)
  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap = Splitter.getMaps(rxn)
  val edges: List[IReaction] = Splitter._splitReaction(rxn, rctMap)
  assert(edges.size == 4)
  assert(edges.forall(e => e.getReactantCount == 1 && e.getProductCount == 1))
  // DL-alanine to L-alanine
  assert(new IsomorphismTester(edges(0).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(0)))
  assert(new IsomorphismTester(edges(0).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(0)))
  // DL-alanine to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(1).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(1)))
  assert(new IsomorphismTester(edges(1).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to n-acetyl-D-alanine
  assert(new IsomorphismTester(edges(2).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(2).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(1)))
  // acetic anhydride to acetic acid
  assert(new IsomorphismTester(edges(3).getReactants.getAtomContainer(0)).isIsomorphic(rxn.getReactants.getAtomContainer(2)))
  assert(new IsomorphismTester(edges(3).getProducts.getAtomContainer(0)).isIsomorphic(rxn.getProducts.getAtomContainer(2)))
}

The biggest problem in any programming project is deciding how to represent the data. The data structure that best models a set of chemical reactions is a hypergraph, similar to a graph, but each directed edge can link multiple nodes on either end. That sounded too complicated to analyze to me. Back when I worked at the large chemical information company, I had the idea to break reactions down into something I called edges, each with a single reactant and a single product, which could be modeled as a graph. Moreover, it would be easy to compare individual reactants and products, compared to sets of them, so I could group together identical edges coming from different reactions. This will be important later on.

So, if you have a chemical reaction that looks like this:

A + B -> C + D

A simple way to break this down into edges would be to take every combination of reactants and products:

A -> C	A -> D
B -> C	B -> D

But I thought I could do better. Reactions can be enhanced with a reaction map, which describes which reactant atoms wind up as which product atoms. Thus, you can decide which reactant substances are directly related to which product substances. In other words, if A breaks apart and contributes atoms to both C and D, but B only contributes to D, the B -> C edge can be dropped. Fortunately, the NextMove patent reaction data set comes with reaction maps for most, if not all, of the reactions.

Now the CDK has a Mappings class, but it is not automatically generated when you parse a smiles string into a reaction. Instead, each mapped atom receives a property (CDKConstants.ATOM_ATOM_MAPPING) whose value is an integer that is the same in reactant and product. So my first job was to massage those properties into something I could use.

import scala.collection.mutable
import scala.jdk.CollectionConverters._

def getMaps(rxn: IReaction): (Map[Int, Mapping], Map[IAtom, IAtom], Map[IAtom, IAtom]) = {
  val atomMap: mutable.Map[Int, Mapping] = mutable.Map[Int, Mapping]()
  val rctMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  val prdMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
  for (rct <- rxn.getReactants.atomContainers.asScala) {
    for (atom <- rct.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        atomMap.put(atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int], new Mapping(atom, null))
      }
      rctMap.put(atom, null)
    }
  }
  for (prd <- rxn.getProducts.atomContainers.asScala) {
    for (atom <- prd.atoms.asScala) {
      if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
        val mapNumber: Int = atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int]
        if (atomMap.contains(mapNumber)) {
          val oldMapping: Mapping = atomMap(mapNumber)
          val rctAtom: IAtom = oldMapping.getChemObject(0).asInstanceOf[IAtom]
          val newMapping: Mapping = new Mapping(rctAtom, atom)
          atomMap.put(mapNumber, newMapping)
          rctMap.put(rctAtom, atom)
          prdMap.put(atom, rctAtom)
        } else {
          atomMap.put(mapNumber, new Mapping(null, atom))
          prdMap.put(atom, null)
        }
      } else {
        prdMap.put(atom, null)
      }
    }
  }
  (atomMap.toMap, rctMap.toMap, prdMap.toMap)
}

It took me the longest time to figure out how to convert a Scala collection to a Java collection, which is what the CDK expects, since the interface has changed a couple of times recently. As far as I can tell, this is the current way to do it in Scala 2.13.

This is a pretty typical, straightforward way to do it using mutable maps, iterating over the reactant and product substance and atom collections, and consulting the properties to see if this atom has been mapped. I use the Mappings class to hold mapped reactant and product atoms. I wasn't sure at this point which map would be most useful, so I generated maps of mapping numbers to reactant-product pairs, maps of reactant atoms to product atoms, and maps of product atoms to reactant atoms.

When iterating over the reactant atoms I don't yet know which product atom will correspond to the reactant atom. So I generate a set of reactant atom -> null Mappings. Then, when I'm iterating over the product atoms, when I find a match, I replace that Mapping with the correct reactant atom -> product atom Mapping. If there's no match, then I create a null -> product atom Mapping.

Well, that worked, but it's not very Scalaic (the Scala counterpart to Pythonic?). Specifically, all that explicit iteration is ugg-lee. So here's what I refactored it to:

def getMaps(rxn: IReaction): Map[IAtom, IAtom] = {
  val rctMapNumbers: Map[Int, IAtom] = rxn.getReactants.atomContainers.asScala.flatMap(r => r.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val prdMapNumbers: Map[Int, IAtom] = rxn.getProducts.atomContainers.asScala.flatMap(p => p.atoms.asScala)
    .flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
    .map({ case (n, a) => (n.asInstanceOf[Int], a) })
    .toMap
  val atomMap: Map[Int, IMapping] = (rctMapNumbers.keySet ++ prdMapNumbers.keySet).map(n => {
    (n, new Mapping(rctMapNumbers.getOrElse(n, null), prdMapNumbers.getOrElse(n, null)))
  }).toMap
  val rctMap: Map[IAtom, IAtom] = rctMapNumbers
    .map({ case (n, a) => (a, atomMap.getOrElse(n, new Mapping(a, null))) })
    .map({ case (a: IAtom, m: IMapping) => (a, m.getChemObject(1).asInstanceOf[IAtom]) })
  rctMap
}

Generating the maps of numbers to reactant and product atoms is repetitive and could be pulled out into a method, but I usually don't start worrying about that until I do it more than twice. I like using flatMap, not only to generate a single list of atoms from the reactants and products, but also to essentially filter out the atoms that aren't mapped by zipping the map number with Some(a). I also like the trick of combining the keys of rctMapNumbers and prdMapNumbers into a single deduplicated set.

To test it I picked a simple reaction and created the expected map explicitly, and compared it atom by atom to what the method generated.

val esterSmiles = "[CH3:5][CH2:6][OH:7].[CH3:1][C:2](=[O:3])[OH:4]>[H+]>[CH3:5][CH2:6][O:7][C:2](=[O:3])[CH3:1].[OH2:4] Ethyl esterification [1.7.3]"

test("it produces correct mappings") {
  val rxn: IReaction = sp.parseReactionSmiles(esterSmiles)
  val ethanol: IAtomContainer = rxn.getReactants.getAtomContainer(0)
  val aceticAcid: IAtomContainer = rxn.getReactants.getAtomContainer(1)
  val ethylAcetate: IAtomContainer = rxn.getProducts.getAtomContainer(0)

  val expectedRctMap: Map[IAtom, IAtom] = Map(
    aceticAcid.getAtom(0) -> ethylAcetate.getAtom(5),
    aceticAcid.getAtom(1) -> ethylAcetate.getAtom(3),
    aceticAcid.getAtom(2) -> ethylAcetate.getAtom(4),
    aceticAcid.getAtom(3) -> null,
    ethanol.getAtom(0) -> ethylAcetate.getAtom(0),
    ethanol.getAtom(1) -> ethylAcetate.getAtom(1),
    ethanol.getAtom(2) -> ethylAcetate.getAtom(2)
  )

  Splitter.filterInorganicsFromReaction(rxn)
  val rctMap: Map[IAtom, IAtom] = Splitter.getMaps(rxn)

  assert(expectedRctMap == rctMap)
}

So now I have a map of reactant atoms to product atoms. Next time I'll show how I used that to split reactions into edges of mapped reactant-product pairs.

The Solitary Programmer

Sunday, October 20, 2019

Reactions to edges 2: Breaking reactions down into edges

Wednesday, October 9, 2019

Reactions to edges 1: Massaging the mapping information