Chemical reactions have always been what fascinated me most about chemistry: stoichiometry in high school, organic reaction mechanisms in college, and transformations later on. When I was employed at a large chemical information company I learned about graph theory and treating sets of organic reactions as a graph. I loved the idea of processing a set of chemical reactions so that you could immediately know whether it was possible to get from any reactant to any product. While I was at that large chemical information company I worked out an algorithm to do it in MapReduce. Calculating the full transitive closure was impossible even on the cluster I had access to, but I always wondered if implementing it in Spark and Scala would make it feasible. A year or so ago I heard about the
NextMove patent reaction database, so I decided to see if I could make it work using what I had learned in the meantime.
I downloaded the reaction set and took a look. The reactions are recorded in SMILES with reaction mapping. Fantastic! That's just what my algorithm needed.
I use the
Chemistry Development Kit to process chemical information in my code. Spark is natively implemented in Scala, with alternate interfaces in Java and Python. I thought Scala would handle the abstract concepts I needed and do the processing best out of the three.
The first step was to parse out a reaction object from the SMILES string. One of the nice things about Scala is that it runs on the JVM, so you can import and use Java objects easily. So this one was simple:
var sp: SmilesParser = new SmilesParser(SilentChemObjectBuilder.getInstance())
def parseSmiles(smiles: String): IReaction = sp.parseReactionSmiles(smiles)
Lastly, I am unashamed to be completely biased toward organic substances, so I wanted to filter out any inorganic compounds, even if they happened to be mapped in the NextMove reaction. This was my first approach:
val rctIterator = rxn.getReactants.atomContainers.iterator
while (rctIterator.hasNext) {
val rct: IAtomContainer = rctIterator.next
val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(rct)
if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
rctIterator.remove
}
}
and something similar for the products. I then refactored that into a method that I call twice, once for the reactants and again for the products.
def filterInorganicsFromReaction(rxn: IReaction): Unit = {
filterInorganicsFromMolListIterator(rxn.getReactants.atomContainers.iterator)
filterInorganicsFromMolListIterator(rxn.getProducts.atomContainers.iterator)
}
def filterInorganicsFromMolListIterator(subIt: java.util.Iterator[IAtomContainer]): Unit = {
while (subIt.hasNext) {
val sub: IAtomContainer = subIt.next
val formula: IMolecularFormula = MolecularFormulaManipulator.getMolecularFormula(sub)
if (!MolecularFormulaManipulator.containsElement(formula, Elements.CARBON)) {
subIt.remove
}
}
}