13 December 2015 CAS as the Evil Empire.

I was once at a meeting at the Royal Society of Chemistry getting some training on the wonderful ChemSpider web service. During the meeting, in some frustration, I said that CAS (the Chemical Abstracts Service) was "the evil empire". Because an excellent person from CAS was there the remark was taken as being somewhat light-hearted. However, a few days later my remark was repeated at a large gathering of European chemists and, I am told, created a hearty round of applause. Clearly CAS is not loved by everybody. Why?

The reason for my outburst was the frustration felt by anyone who searches ChemSpider (or most other chemical databases) for a chemical. Although ChemSpider works near-miracles with name searches (it can generally sort out common informal names, though has more trouble with formal names if you miss out some minor detail), name searches can regularly fail because naming is so complex. Chemical formula searches can drown you in outputs. If you have a SMILES string then life is good - ChemSpider can track down your chemical very rapidly, but I am regularly on ChemSpider trying to find the SMILES string.

The obvious thing we all try is the CAS number. We know that all chemicals have such a number and that each number is unique. So enter the CAS and all will be well.

But it's not. CAS searches can be frustrating. We might (unknown to us) have the wrong number. Or ChemSpider might have the wrong CAS ascribed to a given chemical. Or it might have no CAS number at all for that chemical. Or the same CAS number might appear in several different, unrelated chemicals. Why, given that ChemSpider is so awesome, can't it do a simple thing like getting its CAS numbers correct?

The answer provided at the meeting explains my frustration. CAS is a black box. A chemical name goes in, a number comes out and then sort of drifts around the world with everyone assuming that if X says that the CAS number for chemical Y is A-B-C then that must be the case. There seems to be no way, even for people at ChemSpider, to go into the CAS database and check that A-B-C is the correct number, or to put in D-E-F and get the correct chemical name and, maybe, something that is of practical use such as a SMILES.

So the world is awash with incorrect CAS numbers and there seems to be nothing anyone can do about it. Now if I were boss of CAS I would make the whole thing open to everyone so that smart people could curate vast datasets (such as ChemSpider) and ordinary folk like me could, given a table of chemicals with their CAS numbers, quickly confirm that the chemicals were the right ones and get key information such as SMILES so I can then do something active with that dataset (such as predict all the HSP values using the Y-MB capability built in to HSPiP). I have spent countless tedious hours trying to curate datasets with 10's, 100's and in rare cases 1000's of chemicals. You can get 90% right very quickly and the other 10% soak up hours, then you find that actually one or two you thought were OK were also wrong so you have to double check.

We should not be doing this in the 21st century. It should be trivial to go from CAS to reliable name and structure. But it's not. OK, I know there's another problem, so let's discuss that.

If we could all quickly look up CAS numbers and get a reliable chemical we would then face another problem. Large quantities of chemicals have CAS numbers but are not pure chemicals. They are things like "Volatile alkane carbon fraction with BPt between 60-100 deg". Unfortunately I often get requests for the HSP or "representitive" properties of such chemicals. What we need, once CAS becomes properly open, is a "representitive" chemical for these sorts of materials. In this specific case it would be hexane or heptane - not exact but good enough. Our new best friends in CAS might want to provide a sort of Wiki service which allows the user community to agree on the best representitive compound - we'd quickly populate it with structures good enough for our needs.

But I'm dreaming. CAS, of course, are not the Evil Empire. They are providing an important service and it must cost a lot of money to run it. Maybe they can't give it away for free. But just think of the countless hours chemists would save if CAS really were open as the single search point for anyone who wants to know "What chemical is A-B-C?" For me that would be like Christmas almost every day.

Are there alternatives? SMILES are wonderful because you can immediately do something with them. However, they have plenty of limitations, one of which is that although there is a unique "canonical" SMILES for each structure, anyone can enter the SMILES any way they like. Ethanol is canonically CCO, but OCC works just as well for everything other than searching. InChI are wonderful because they too are "live" so you go from code to chemical unambiguosly and they are only ever canonical so if you do an InChI search for ethanol you are guaranteed success. The downside is that they can be long and cumbersome and are more-or-less impossible to create by hand, whereas I regularly find it quicker to create a SMILES from scratch than look it up. A unique InChIKey makes it easier to enter a search code as it is always a fixed size even if the molecule itself creates an enormous InChI; however the InChIKey is as meaningless as a CAS number so you need a good reference database to go from Key to chemical.

Much as I love InChI, its adoption seems to be minimal. CAS numbers are so darned easy and ubiquitous that they are not going to get displaced any time soon. So we just have to hope that the nice folk at CAS decide that opening up the database would be a jolly good idea after all.

One final moan about CAS numbers. But this time the Evil Empire are Microsoft. If you paste a dataset with 1000 CAS numbers into Excel you can work away for a few hours doing what you have to do - only to come across a strange date in your CAS column. We've all done it and it's immensely frustrating. Suppose my CAS number is 2015-12-5 (I can't find it in ChemSpider). Excel is "smart" and interprets this as 5 December 2015 and formats it accordingly. Maybe you can excuse this smartness. But if the CAS is 2215-12-5 (2-methoxyfumaric acid) it renders is as 5 December 2215. Now not many of us will ever want to enter that as a date - why can't Excel think "Hmmm - unlikely, I'll leave it as it is". There is no way one can turn off Excel's "smartness" so most of us have had our Excel datasets ruined by this conversion. OK, it's not too hard to unconvert it, and if you remember in advance you make sure your CAS column is a Text field rather than General, but golly it's a nuisance.

Oh, by the way. I have to point out that CAS Registry Number is a Registered Trademark of the American Chemical Society. I don't want to upset CAS any more than I have already done.