Squaring the SQL Circle
Much like the ongoing rumblings in the consumer market over the relative merits of rivals iOS and Android, the SQL/NoSQL face-off occupying the world’s database developers, (and giving rise to countless blogs and online forums) shows no sign of abating. In the past few years, there has been increasing interest in the topic and the recent buzz about sentiment trading has only added fuel to the NoSQL fire.
Of course, although the topic is often labelled as SQL/NoSQL, the real underlying issue is the comparison between relational and non-relational databases. Indeed, we could even class it as traditional databases versus new databases such as Cassandra and MongoDB. It has earned this title because SQL is the lingua franca of relational databases, but is little used in non-relational databases.
In financial markets, the primary requirement of the database is for storing and analysing very high volumes of financial market data, with best possible performance. Therefore, the key to identifying the best possible database for your firm’s needs is to establish the relative importance of performance and cost.
Database performance is critical to algorithmic trading, risk management, complex event processing, compliance and reference data management. Financial market databases typically have simple structures that are already well handled by the tabular format of traditional SQL databases. Where there are hundreds of millions of records that follow a relatively standard format, it is possible to make certain assumptions that can give better performance by orders of magnitude. In this case, the NoSQL databases offer no advantage, and not only will they have higher latency and comparatively poor performance, but they may also lack the required features such as support for specialised data types.
One obvious consideration is how the data itself may be stored. For example, join and query performance depends on being able to access values in specific columns very efficiently, so is much better when the underlying data is stored by column rather than by record. This benefit extends under the covers: corporate adjustments can be applied on the fly as data is read in, there is no need for special indices or complex query optimizers, and since columns usually have identical data type throughout, they can be very efficiently compressed. In contrast, key-value store is doubly slow – not only is data stored by record, but each record itself may have a complicated structure.
Also, in the typical database use for market data and applications such as network monitoring, a single server machine could handle an entire database. For example, the NYSE TAQ feed currently has around a billion records per day (down from a peak of more than 2 billion). This can easily be handled by a single machine, and indeed the entire day’s feed can be stored in memory. This is the preferred set up, since it means that the machine can respond immediately to queries, instead of having to farm out queries over a network. From a design perspective, it is also a more elegant option – not to mention easier to maintain. It can even run complex analytics as the data is received. If a server is heavily loaded because of a large number of clients or complex queries, then the solution would be to set up another server with its own copy or a replica of the database, i.e. the data feed would then go to two or more machines.
Since the work involved in storing, accessing and updating NoSQL records is much larger, the volumes that any one machine can handle are much lower than for traditional databases. In fact, NoSQL databases are typically intended to be distributed over several machines, and so must manage distributed queries, load balancing, backups, replication and so on. Such databases typically have poor query performance. For example, a join may need to read from several machines, and each read has to work with the internal structure of the key-value record. For this reason, some of these databases do not support ad hoc queries like joins, instead requiring custom routines for such data access.
NoSQL databases typically use a key-value store. Suppose a database is used for storing website orders. An order might be considered as a single transaction, but might encompass several items, special instructions and other meta data. Storing that in a traditional SQL database might require adding many rows to several tables. This obviously complicates the design, and in particular, would be troublesome for a distributed database. However, in the key-value store model, an entire transaction would be stored in a single record, where the key might be the transaction number, and the value might be a list of items that correspond to the columns of a traditional database, but where each item in turn can have structured contents such as lists of parts, prices, timestamps and so on. Necessarily, the work involved in storing, accessing and updating such records is much larger than for records in a typical SQL database.
There’s no doubt that NoSQL developments offer a great deal of potential when deployed in the circumstances for which they were intended. However, they are currently less well-suited to the demanding, time-series oriented applications required by financial market data. For the current needs of trading firms, SQL databases offer superior performance and cost, but this debate looks set to run and run.