Big Data, Advances in Computational Sciences, and Oncology Care


As information technology has matured, organizations and industries embracing the use of data-driven analytics to guide services has profoundly changed the human experience. Just-in-time inventory, predictive pricing models and simple applications such as web searches have generated immense gains in productivity, efficiency, and reliability. The past 40 years have also seen a revolution in the use of data in health care. Within medicine there is an explosion of registries, such as the National Anesthesia Clinical Outcomes Registry (NACOR) from the Anesthesia Quality Institute, tracking clinical events and providing practice benchmarks and quality reporting for its members. Requiring a mix of manual data entry and automated data extraction, these efforts attempt to build a data storehouse to help identify trends and understand practice. Within the wider scope of medicine, efforts are underway to incorporate information from many diverse sources, such as electronic health records (EHRs), genomic testing, claims, and public data sets. These advances will profoundly sculpt the face of health care from patient experience to patient outcomes. The impact of these forces cannot be underestimated.

Little Data to Big Data

Database technology is ubiquitous in our modern existence. Its roots trace back to the 1970s when computer systems were fairly novel. Commonly, the mainframe vendor engineered a closed software platform and was the sole source for the end-user applications. The hardware, operating system, and software were only available from one source. Since there were no competitors at the software layer, standards were internal and ad hoc. This issue created hardware vendor silos inhibiting third party developers. With no competition general product advancement was slow and development costly. Each application was a unique development effort and rarely leveraged work from other products. Due to this vertical integration, it was common for application designers to create and recreate ways to store, access, and aggregate data.

To overcome this, on the data side researchers at the IBM Corporation began to investigate how a layer of abstraction could be constructed between the application and the data. In creating this boundary, application programmers were no longer required to code their own data handling routines or other tools to manipulate the data cache. This break between data management and application programming was foundational. To support this, a standard interface was created to bridge this divide. In 1970, a team led by Dr. Edgar F. Codd at the Almaden Research Center published their work on building a standardized data query language called SEQUEL. SEQUEL became the bridge and interface between data and application. Shortened to SQL over time, the language provided standard tools for programmers to manipulate data without regard to how it actually was stored or represented at the file level. In creating this split, the database management function (DBMS) became a separate service to the programmer, much like the screen display or the network. The SQL language was initially proprietary, but in 1979, a company named Relational created its own competitive and compatible database management software, having reverse engineered SQL. Relational would later be known by the corporate name of Oracle.

The database that Codd described in his publications and implemented in IBM software was known as a relational model. The main competing structure was called hierarchical and could be thought of as a tree with branches or nodes for different items. Relational databases organize data into tables. Tables can be stand alone, like a spreadsheet, or be linked together by keys. The use of keys traces a map of data, linking and defining how the data elements relate to each other. A SQL query leverages these relationships to pull data that conforms to both the constraints of the data model as well as the constraints of the SQL language. Given the standardization of database technology and explosion in data storage, relational databases became part of the fabric of computing. They enforced discipline on the data model through the syntax of the SQL language and ensured that in using a database engine (relational database management system; RMDBS) data could not be inadvertently corrupted through SQL-based manipulation. Designed in the 1970s, these structures worked well and delivered a robust and reliable performance. As computing, storage, and networking continued to evolve, however, the flood of data requiring processing and analysis expanded at an exponential rate. SQL was born in the primacy of the mainframe. In the world of the web, the Internet of Things (IoT), and digitization of whole new categories of daily artifacts, SQL and the relational database began to show its 50-year roots as challenged by the rise of the server.

NoSQL Technology

SQL-based technologies were hitting the performance wall. As the size of the database increased, it required more processing horsepower. While processors were accelerating in capacity, the rate of data generation outstripped their ability to scale. In the 1990s the World Wide Web (WWW) was beginning to grow by leaps and bounds. Architects in this environment scaled their web server performance using multiple compute cores spread across multiple servers to expand their ability to handle the increasing load. They were able to scale horizontally. The RDBMS vendors struggled to use this type of architecture. Mainframes where SQL was born were single processor devices. Due to the way the RDBMS interacted with data, it was technically difficult to make the application work when using multiple servers. It was constrained to scale only vertically, i.e., a single compute core that had to run faster to deliver improved performance. This pressure to create a more scalable and flexible model continued to build.

Driven by the large web utilities, Facebook, Google, LinkedIn, and others, internal research and development resources were tasked to explore alternative paths. The goals were to create a flexible framework that could handle a diverse set of data types (video, audio, text, discrete and binary) using racks of inexpensive hardware. Unlike the RDBMS/SQL solution with a standard interface, defined language and deterministic performance, the new framework, now known as NoSQL, was extremely heterogeneous and spawned a number of database management systems built to optimize specific classes of use cases. Over the past decade, work has gone into classifying data storage and retrieval challenges. From this work the specific task typically can be addressed using one of four categories of NoSQL topologies: Key-value, Column-oriented, Document-oriented, and Graph databases. Underlying these tools is the ability to replicate data across multiple servers, harness the power of numerous processors, and scale horizontally. These database management systems underpin the infrastructure of Facebook, Twitter, Verizon, and other large data consumers.

In health care, the data environment is still evolving. As the installed base of EHR continues to climb, volumes of granular data about and surrounding the management of each encounter are recorded in digital form. Dwarfing this is the volume of imaging data captured. Penetration of digital radiology workflows is substantial, and the images captured result in petabytes of data just within institutions. By 2013, more than 90% of US hospitals had installed digital imaging, and adoption of 3D image reconstruction in hospitals now exceeds 40%. Technology is now transforming pathology workflow with digital whole slide image capture becoming more widely adopted. This is a massive data management exercise, as images contain terabytes each. Additionally, the slide images required to support a diagnosis would include scanning of the entire specimen at differing levels of magnification. For liquid slides, such as bone marrow of blood smears, this is increased due to the need to capture images at different focal planes also.

Within the operating room (OR) and critical care environments, streaming data is ubiquitous. The current generation of anesthesia machines, infusion pumps, noninvasive monitors, and ventilators emit a continuous river of information on a second by second basis. Physiologic and machine data from a large hospital can comprise hundreds of kilobits of discrete data per second 24 h a day. Buried in this data are clues warning of patient deterioration, sepsis, and intraoperative events. The tools necessary to find these signals must function on data that is streaming rather than at rest. While NoSQL technology can be used to store data, its value is manifest when it is also used to process data. In the example above, streaming data from different patients flow into a data processing engine and are segregated by patient. These individual patient data streams can be processed by algorithms colocated on the same server as the data. This combination of processing and storage on the same platform sets NoSQL apart from a performance standpoint. These algorithm/storage pipelines can continuously process data looking for predetermined signals. Compared to a traditional RDBMS, applications outside the database would be continuously making SQL calls for data to feed the analytics. This analytic SQL traffic would be competing against the SQL traffic driven by the data ingestion. Leveraging a NoSQL data structure that incorporates pipelines and the ability to process data locally enables analysis with low cost hardware and high throughput.

Medicine is in an age of information. Whether the application requires a SQL-type database or a NoSQL data processing engine to handle the ingestion and analysis of terabytes, the tools and technology to cope with this torrent of data are well developed and readily available.

Computational Advances

The first operational electronic computing machines were built during World War II to assist in code breaking of enemy message traffic. These were purpose-built machines and used the technology of the day: telephone relays, vacuum tubes, and paper tape. With the commercial advent of the transistor in the mid-1950s, an enormous change took place as vacuum tubes were replaced with integrated circuits. Now, the function of a tube to switch from on to off, representing a logic one or zero went from the size of a baseball to that of a flea. By the early 1960s, designers were able to put more than one transistor in a package and in being able to connect multiple transistors together on a single piece of silicon they could create “chips” used as logic building blocks.

During this era, Gordon Moore was director of Research and Development at Fairchild semiconductor, and through his observation of the technological trends within the industry, the underlying physical science, and the economic economies of scale, he formulated what has become to be known as Moore’s Law. Dr. Moore stated, “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term, this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.” In this statement from 1965, he predicted a log-linear relationship that would hold true for over 50 years (see Fig. 61.1 ). Gordon Moore went on to found Intel Corporation.

Fig. 61.1, From Max Roser. https://ourworldindata.org/uploads/2019/05/Transistor-Count-over-time-to-2018.png , CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=79751151 .

The relentless advance of semiconductor technology has paved the way for smaller devices and increased computing capacity. Paired with this is the bountiful harvest of connected technologies benefiting from an increased understanding of materials science and physics. Storage technology, networking bandwidth, and video display all have similar cost-performance curves with incredible capabilities contained within inexpensive packages. Modeled after Moore’s law and driven by some of the same economic pressures, Edholm’s law (data rates on wired and wireless networks, see Fig. 61.2 ) predicts the log-linear relationship of performance in these industries.

Fig. 61.2, From Cherry, S. Edholm’s law of bandwidth. IEEE Spectrum . 2004;41(7):58–60. doi: 10.1109/MSPEC.2004.1309810 .

Cloud Computing

In the mid-2000s as predicted by Edholm’s law, network bandwidth had increased to the point where connectivity was not usually a limiting factor in regard to application functionality. Previously, on-premise compute and storage using specialized data centers had been seen as the only choice to host business class applications. With many applications moving to the WWW, mobile platforms becoming more robust and the continually increasing server compute power, application owners were looking to outsource management of hardware.

Cloud computing refers to the use of shared resources (storage, compute, and processing) that are located oftentimes in IT infrastructure owned by a third party. Cloud has become widely adopted due to ease of use and low upfront capital costs. In July 2002 Amazon Corporation launched Amazon Web Services (AWS). Used until that point as an internal resource, Amazon began to open its platform to customers to run their applications. From the end-user perspective, items such as maintenance, upkeep, and security become contractual terms handled by the cloud provider, not tasks staffed by the customer. Eliminating this overhead enabled organizations to take advantage of an incremental approach to the technology with the ability to increase size or capability rapidly, needing only a contract amendment. For individual researchers using these services for analytics or machine learning applications, they are charged only for the resources they use, when they use them, lowering the bar to access technology of a world-class platform. Today Amazon (AWS), Microsoft (Azure), Google (Google Cloud Platform), VMWare, and IBM (IBM Cloud) are dominant forces within the industry.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here