The biggest problem in any programming project is deciding how to represent the data. The data structure that best models a set of chemical reactions is a
hypergraph, similar to a graph, but each directed edge can link multiple nodes on either end. That sounded too complicated to analyze to me. Back when I worked at the large chemical information company, I had the idea to break reactions down into something I called edges, each with a single reactant and a single product, which could be modeled as a graph. Moreover, it would be easy to compare individual reactants and products, compared to sets of them, so I could group together identical edges coming from different reactions. This will be important later on.
So, if you have a chemical reaction that looks like this:
A + B -> C + D
A simple way to break this down into edges would be to take every combination of reactants and products:
A -> C |
A -> D |
B -> C |
B -> D |
But I thought I could do better. Reactions can be enhanced with a reaction map, which describes which reactant atoms wind up as which product atoms. Thus, you can decide which reactant substances are directly related to which product substances. In other words, if A breaks apart and contributes atoms to both C and D, but B only contributes to D, the B -> C edge can be dropped. Fortunately, the NextMove patent reaction data set comes with reaction maps for most, if not all, of the reactions.
Now the CDK has a Mappings class, but it is not automatically generated when you parse a smiles string into a reaction. Instead, each mapped atom receives a property (CDKConstants.ATOM_ATOM_MAPPING) whose value is an integer that is the same in reactant and product. So my first job was to massage those properties into something I could use.
import scala.collection.mutable
import scala.jdk.CollectionConverters._
def getMaps(rxn: IReaction): (Map[Int, Mapping], Map[IAtom, IAtom], Map[IAtom, IAtom]) = {
val atomMap: mutable.Map[Int, Mapping] = mutable.Map[Int, Mapping]()
val rctMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
val prdMap: mutable.Map[IAtom, IAtom] = mutable.Map[IAtom, IAtom]()
for (rct <- rxn.getReactants.atomContainers.asScala) {
for (atom <- rct.atoms.asScala) {
if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
atomMap.put(atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int], new Mapping(atom, null))
}
rctMap.put(atom, null)
}
}
for (prd <- rxn.getProducts.atomContainers.asScala) {
for (atom <- prd.atoms.asScala) {
if (atom.getProperties.containsKey(CDKConstants.ATOM_ATOM_MAPPING)) {
val mapNumber: Int = atom.getProperties.get(CDKConstants.ATOM_ATOM_MAPPING).asInstanceOf[Int]
if (atomMap.contains(mapNumber)) {
val oldMapping: Mapping = atomMap(mapNumber)
val rctAtom: IAtom = oldMapping.getChemObject(0).asInstanceOf[IAtom]
val newMapping: Mapping = new Mapping(rctAtom, atom)
atomMap.put(mapNumber, newMapping)
rctMap.put(rctAtom, atom)
prdMap.put(atom, rctAtom)
} else {
atomMap.put(mapNumber, new Mapping(null, atom))
prdMap.put(atom, null)
}
} else {
prdMap.put(atom, null)
}
}
}
(atomMap.toMap, rctMap.toMap, prdMap.toMap)
}
It took me the longest time to figure out how to convert a Scala collection to a Java collection, which is what the CDK expects, since the interface has changed a couple of times recently. As far as I can tell, this is the current way to do it in Scala 2.13.
This is a pretty typical, straightforward way to do it using mutable maps, iterating over the reactant and product substance and atom collections, and consulting the properties to see if this atom has been mapped. I use the Mappings class to hold mapped reactant and product atoms. I wasn't sure at this point which map would be most useful, so I generated maps of mapping numbers to reactant-product pairs, maps of reactant atoms to product atoms, and maps of product atoms to reactant atoms.
When iterating over the reactant atoms I don't yet know which product atom will correspond to the reactant atom. So I generate a set of reactant atom -> null Mappings. Then, when I'm iterating over the product atoms, when I find a match, I replace that Mapping with the correct reactant atom -> product atom Mapping. If there's no match, then I create a null -> product atom Mapping.
Well, that worked, but it's not very Scalaic (the Scala counterpart to Pythonic?). Specifically, all that explicit iteration is ugg-lee. So here's what I refactored it to:
def getMaps(rxn: IReaction): Map[IAtom, IAtom] = {
val rctMapNumbers: Map[Int, IAtom] = rxn.getReactants.atomContainers.asScala.flatMap(r => r.atoms.asScala)
.flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
.map({ case (n, a) => (n.asInstanceOf[Int], a) })
.toMap
val prdMapNumbers: Map[Int, IAtom] = rxn.getProducts.atomContainers.asScala.flatMap(p => p.atoms.asScala)
.flatMap(a => a.getProperties.asScala.get(CDKConstants.ATOM_ATOM_MAPPING) zip Some(a))
.map({ case (n, a) => (n.asInstanceOf[Int], a) })
.toMap
val atomMap: Map[Int, IMapping] = (rctMapNumbers.keySet ++ prdMapNumbers.keySet).map(n => {
(n, new Mapping(rctMapNumbers.getOrElse(n, null), prdMapNumbers.getOrElse(n, null)))
}).toMap
val rctMap: Map[IAtom, IAtom] = rctMapNumbers
.map({ case (n, a) => (a, atomMap.getOrElse(n, new Mapping(a, null))) })
.map({ case (a: IAtom, m: IMapping) => (a, m.getChemObject(1).asInstanceOf[IAtom]) })
rctMap
}
Generating the maps of numbers to reactant and product atoms is repetitive and could be pulled out into a method, but I usually don't start worrying about that until I do it more than twice. I like using flatMap, not only to generate a single list of atoms from the reactants and products, but also to essentially filter out the atoms that aren't mapped by zipping the map number with Some(a). I also like the trick of combining the keys of rctMapNumbers and prdMapNumbers into a single deduplicated set.
To test it I picked a simple reaction and created the expected map explicitly, and compared it atom by atom to what the method generated.
val esterSmiles = "[CH3:5][CH2:6][OH:7].[CH3:1][C:2](=[O:3])[OH:4]>[H+]>[CH3:5][CH2:6][O:7][C:2](=[O:3])[CH3:1].[OH2:4] Ethyl esterification [1.7.3]"
test("it produces correct mappings") {
val rxn: IReaction = sp.parseReactionSmiles(esterSmiles)
val ethanol: IAtomContainer = rxn.getReactants.getAtomContainer(0)
val aceticAcid: IAtomContainer = rxn.getReactants.getAtomContainer(1)
val ethylAcetate: IAtomContainer = rxn.getProducts.getAtomContainer(0)
val expectedRctMap: Map[IAtom, IAtom] = Map(
aceticAcid.getAtom(0) -> ethylAcetate.getAtom(5),
aceticAcid.getAtom(1) -> ethylAcetate.getAtom(3),
aceticAcid.getAtom(2) -> ethylAcetate.getAtom(4),
aceticAcid.getAtom(3) -> null,
ethanol.getAtom(0) -> ethylAcetate.getAtom(0),
ethanol.getAtom(1) -> ethylAcetate.getAtom(1),
ethanol.getAtom(2) -> ethylAcetate.getAtom(2)
)
Splitter.filterInorganicsFromReaction(rxn)
val rctMap: Map[IAtom, IAtom] = Splitter.getMaps(rxn)
assert(expectedRctMap == rctMap)
}
So now I have a map of reactant atoms to product atoms. Next time I'll show how I used that to split reactions into edges of mapped reactant-product pairs.