Kx Systems: Data Management for Successful Algo Trading in China
With the growing interest in algo trading in China comes an increasing awareness of the need for high-performance management of market data. The computer algorithms underlying any automated trading system must use actual market data to make decisions. This data can also be used for associated activities such as pre-trade risk management, regulatory compliance and reference data. A related issue is that the number of market data sources and product complexity means that data is best handled by centralized high-performance data management systems that can feed throughout the enterprise.
The data volumes can be huge. Worldwide equity markets alone produce around 3 billion records per day, with options and FX at least as large. An essential goal is the ability to keep up with these market volumes, and in particular, with bursts of activity during the day and with the expected growth over the next several years. All this provides an exciting challenge for China’s investment industry to develop or acquire the technology to manage this data effectively.
At the moment, many financial institutions in China either do not keep their own market data at all, relying on third party vendors where necessary, or else store their data in a traditional database format that does not support the high-performance queries demanded by algo trading or other real-time data processing. Some companies have written custom built applications with data stored in flat files – this works fine for storing the data, but tends to be inconvenient for ad-hoc queries, as these have to be hand-coded in a low-level programming language.
Another common problem is that data is stored in exactly the same format it was received – but again, this is often not efficient for queries. What is needed is some way of storing the data that also permits very efficient queries on it.
In planning for this, it is helpful to consider the real-time data (current market activity), and the historical data (previous market activity) separately. At least one of these is needed, and nearly all companies need both. For example, even trading algorithms or risk management systems that work solely on real-time data might nevertheless be developed and back-tested on historical data. Each has its own special concerns.
For real-time data, a key goal is lowest possible business latency, i.e. the time between the arrival of new data from the exchange, through to the trading application that makes decisions based on the new data. The competition for low latency hinges on the fact that if two companies are running similar trading strategies, then the company that can deliver new data to its algorithms first will perform better. The measure here is how much faster than the competition this can be done, not the absolute time taken from the exchange. Typically, the key is to reduce the number of steps through which the data is routed, the time taken for each step and the time spent converting data between different formats between steps.
Once the exchange data has been received, some key strategies for minimizing latency are:
* Store real-time data in-memory. This gives fastest performance on analytics and queries, and is quite feasible on modern machines, where RAM can be in the hundreds of GB. In this case, data would be written out to the historical database at day end.
* Run analytics directly on the data as it is received, rather than storing data first and then sending it to a separate process for evaluation.
* Eliminate the cost in time and memory of marshalling of data between different formats by using a single database format throughout – for streaming queries, CEP, intra-day storage and history.
* Use publish and subscribe mechanisms to offload processing from the main server to chained servers, thus making maximum use of cores and machines.
Here, the main problem is the sheer size of the data, typically running into the terabytes. This data must be stored in a way that permits efficient access for trading and monitoring applications. Unlike the real-time database, there is no way a historical database can be stored in RAM; instead it must be stored on disk.
Several strategies can be used to maximize performance on large historical databases.
* Typically, the historical data is partitioned over several drives. If so, queries can be run in parallel on the data partitions, so a server can farm out queries to slaves that access specific disk drives. These partitions are further partitioned by date, and records sorted by symbol, giving best performance on typical queries.
* Data should be stored in columns, which results in queries that are orders of magnitude faster than with traditional row oriented databases.
* Data compression can be used to reduce storage requirements, and this can also help to reduce network latency.
Putting data front and center
As the trend toward algo trading takes hold in China, early adopters have an opportunity to steal a march on their competitors. In other markets, high-performance management and warehousing of market data have often been considered an afterthought when implementing an automated trading strategy, primarily because expanding and updating existing platforms can be an arduous and resource-intensive process, while these elements are in themselves directly revenue-generating.
Yet Chinese firms making the move to algo trading and installing new IT infrastructure have the chance to put data front and center, and benefit from the lowered risk and greater alpha that can result. Centralized high-performance data management systems feed vital information throughout the enterprise at lightning speed, while optimized warehousing ensures that large volumes of stored data can by queried quickly and efficiently. At the same time, managing real-time data and historical data via a single central system ensures lowest latency and maximum performance across both.