22 Jan 2015 The curse of chemical names

I have a list of 3000 chemicals from a client who has painstakingly assembled a set of safety/tox information. "All" I had to do was add the chemical information for each relevant molecule. This is easy to do using the HSPiP software. Because I'm one of its authors I can even tweak the code to get it to do exactly what I want, such as read in the list of chemicals and output any of the known data from the 10,000 chemicals within HSPiP.

Sadly, there were only about 900 clear matches between the two databases so I was left with 2100 to sort out. "All" I had to do was find the SMILES strings (an elegant shorthand chemical notation) for each of those chemicals then feed that list to HSPiP which can automatically read the SMILES, do some internal calculations and create a list of properties such as BPt, vapour pressure, LogP along with the Hansen Solubility Parameters which are a key part of the safety function for which I'm creating the database. The problem is that, like most datasets, the original data is "dead" in that it just has names and CAS #. In theory it's easy to go from either of those to a "live" SMILES string. You can, for example, type the name into Wikepedia which generally provides an accurate SMILES for any reasonable chemical for which it has data. But Wikipedia is too limited and automating the process is not likely to be a great success

The obvious answer is ChemSpider, an astounding proof of what a few people with vision and energy can accomplish and then another proof of why eventually it's a good idea to be surrounded by a great organisation, in this case the Royal Society of Chemistry. As a side-track, traditionally such organisations tend to ossify over time. But the RSC has constantly reinvented itself and is in great shape for the 21st century. Those who know me will know that I don't often say nice things about large organisations.

Coming back on track. ChemSpider allows you to throw a bunch of chemical names or CAS # into its API and it can spit out useful data such as SMILES. This is great in theory and, as it happens, for my particular dataset a complete disaster. It's debatable whether it's more work to pre-screen what is sent to ChemSpider or post-screen what it returns or simply to do most things by hand. The issue is that names are more-or-less worthless if they haven't themselves been curated by someone who can spot every little detail that can wreck a IUPAC name, or who can spot whether a "common" name is either common or unique.

What about CAS #. These are a disaster too. Because CAS (for sound commercial reasons) has never opened itself to automatic outside checking, and because there is no way (outside CAS) to check a CAS number, there are millions of data entries (and web pages) with wrong CAS numbers and many datasets (such as ChemSpider) have multiple CAS # for the same chemical - some genuine, e.g. distinguishing between optically active and racemic versions, many just typos somewhere along the line. So an automated request to ChemSpider might come back with 4 chemicals with the same CAS number, 3 of which are (more or less) the same and the 4th is something totally different. Another reason why CAS is a disaster in terms of finding chemicals is that numbers exist (rightly, I suppose) for things like: Distillates (petroleum), depentanizer overheads; Low boiling point naphtha — unspecified; [A complex combination of hydrocarbons obtained from a catalytic cracked gas stream. It consists of aliphatic hydrocarbons having carbon numbers predominantly in the range of C4 through C6.], CAS # 68477-89-4. There is no hope of finding a representative structure in ChemSpider! So I've ended up doing everything by hand.

It is easy to moan. But I want to end on a positive note. I got involved with ChemSpider near its start (and had the pleasure of meeting the amazing Antony Williams at the RSC) and it was always made clear that ChemSpider contained millions of errors, of which my concerns about names and CAS # are a part. The ChemSpider team are constantly using smart tools to improve the quality of the data. But no conceivable team could hack through all the errors and any "smart" tool that can fix 100,000 errors can easily create 1000 new ones. So they have always relied on a very large, mostly willing, highly knowledgable and usually opinionated group of people called its users. In the early days it was rather hard to identify and fix an error. But during today's work I happened to find 2 very bad "chemicals" (actually nonsense) with the same CAS # as the real ones. A click of a button, a quick description of the problem and a suggestion to fix it, another button and ... 5 min later ChemSpider comes back saying "Thanks - yep, that's an error, we've nuked that entry." I've probably found 200 errors over the years - a totally insignificant number. But the point is that if every user reported every error they found then although the task of getting, say, SMILES, from a list of chemicals would still not be easy, it would be much less hard.