Big Data

This abstract was mainly created because this was a presentation I once held. Alongside that some ideas and correlations are explained and written down in here. The original slides are available here

Definition

To understand what Big Data is and how it works, I want to present an example from 2019. Back then, a german data scientist by the name of David Kriesel held a talk about the Deutsche Bahn (German General Rail Services). This talk served to point out how pretty much every positive stereotype of both rail and germany was wrong. Daniels idea was to collect publicly accessible data about train operations from the DB and then visualize and interpret the data regarding performance metrics. If you are interested, the whole talk alongside companion material is available here.

Daniel collected data from public API endpoints in cooperation with the DB resulting in quite hilarious facts about internal operations. This is called data acquisition. The dataset resulting from a projected 14 TB of network traffic is still analyzable by hand, however I think it qualifies as Big Data for it is both too impractical and too time intensive to manually acquire and work with. He then analyzed the compiled ~120GB dataset utilizing a database management software and some carefully formulated queries. This resulted in data-derived facts about the dataset where accuracy and applicability are directly related to the query formulation. A good query can even distinguish bad data from good data. If there is an error in the collected data an integrated plausibility check can fix a lot of errors.

The (complex) data derived thusly usually needs to be visualized for qualitative analysis. This is called a visualization. Common visualizations include diagrams, flow charts or maps. Maps are ordered diagrams spanning an 2d plane of related properties where the topological and functional relations of data points or data point clouds are more important than the precise location in the plot. Often times these diagrams do not provide axis scaling but instead graph theory edges and corners to relate specific points.

Generally, the data qualifying as Big Data can't be worked on directly but rather needs to be machine digested and visualized to then in turn be analyzed by a human. The whole idea behind the tools and methods related to Big Data is to make hardly workable data understandable for humans. The amount of output data somewhat scales with the amount of input data, thus people always have an interest in working with Big Data. As per my dodgy wording regarding Big Data, there is no universally accepted yet accurate definition of Big Data. The Wikipedia definition is based on qualities of the dataset; namely volume, variety, velocity, veracity and value. Big Data thus needs to be too much to deal with manually in a lifetime (usually multiple TB), too different to be representable with simple means like tables, too fast to have humans working on it (i.e. realtime). It also needs to be reliable and 'good' data on which conclusions can be based (veracity) as well as worth the effort related to the cause. Note that all of these qualities can be expressed explicitly and derived analytically from the dataset.

These qualities enable good scientific meta-analysis of given projects, however they do not allow to easily identify what big data really is (about). Simply put I define Big Data as Data not (statistically or otherwise) analyzable by a human. The Data needs to be pre-digested by a machine, then compiled into a humanly analyzable dataset (usually in a visualization) and can thusly be analyzed manually. The process of collecting Big Data and then looking for patterns in it is usually referred to as Data Mining. Data in this context is meant in its elemental form, two or more presumably correlated values from a common context. Values may include numbers, text, images, other data; basically anything goes unlike as in statistical data which is almost exclusively quantitative. Different types of data linked together in this way is often called multimodal data and especially relevant to machine learning. The source data also is more diluted and spread out compared to 'regular' statistical data. Some companies utilize internal statistics with data from their customers to mostly improve their own services and thus generate more value. This practice is sometimes described in ads as Big Data for marketing reasons, although there really just are 5 data scientists maintaining Excel spreadsheets in the background. This practice of company-based large-scale data analysis is really just called business analytics.

Alongside Big Data the term Small Data as human-analyzable data was also coined. I am not a fan of this classification, however it is useful to know if ever encountered in e.g. articles. The main takeaways from this section are that Big Data has no universal clear-cut definition (yet). The contextual definitions of the bold terms are important for the rest of the article.

Origins

In order for Big Data to eventually receive both the publicity and commercial interest it has today, certain prerequisites had to be met:

Data Generation

The first thing is having data to analyze. This requires the entire lineage of cultural advancements from conscious thinking via writing and numbers up to machine storage. Statistical projects are inherently bound to the amount of data that can be stored. Humans struggle with parallel tasks and can (usually) hardly keep track of three independent counters simultaneously. Being able to write down data efficiently meant larger datasets and more complex conclusions. Simultaneously measuring more data meant more fields of potential research. The methods of measuring evolved and enabled us to express many physical properties in numbers. Today we are able to mechanically measure distances in the Angstrom range (0.1nm) with atomic force microscopes (or formtracers in mechanical engineering) which really just leaves my speechless. Additionally, we tend to collect more and more data. Data storage is getting cheaper by the day and with a digital lifestyle we record and store everything. The broad movement of storing more and more data is called datafication.

Data Analysis

As will be explained in the Science paragraph, the development in analytical methods was directly necessary. The field of classical statistics was the precursor to Big Data and a lot of the tools carry over or were purposefully adapted. Additionally, the development of calculation machines and later information technology was essential as it first enabled the machine digestion of (Big) Data.

Humans being able to fundamentally do science (and Big Data as it is an universal tool) really is a culmination of 12000 years of human culture. All aspects of advancement play into this achievement and Big Data is almost like another evolution step in science, another cultural advancement. What changed is that humans previously used their mind to detect patterns in data. This to this day is the best tool we have available for pattern recognition. However, as per the definition of Big Data, simply throwing pattern recognition at Big Data does not work out. As a solution, we taught machines to first analyze data for us with pattern recognition algorithms we derived from our own way of thinking. For Big Data, we additionally taught machines to augment data and generate ('compile') subordinate datasets which in turn are the datasets we are used to.

Science

The foundation of Big Data is statistics. This is the mathematical subset it is categorized in and the subsequently used tools carry over as well. The general approach is similar to any statistical project (planning, gathering, checking, analysis and interpretation); the difference is in the source data and specific handling. The advancements on the hardware level mentioned above enabled the machine handling of data and wouldn't have been possible without certain key discoveries over the years. Two notable hardware advancements were virtualization and parallelization. The gist of these two techniques is to 1. be able to unite computing resources into one virtual computing system which can work on calculations regardless of physical restraints like power or space and 2. work on multiple tasks simultaneously and efficiently. Thousands of engineers and mathematicians enabled this through the craziest of contributions over the years.

Part of these contributions were due to the rise of video games. The need for fast and accurate 3d calculations meant a new need for operations on data represented in vectors (aka fields). The resulting advancements in graphics processing units and the accompanying software meant operations on large abstract datasets could be done efficiently. This is both great for video games and our use case, complex statistical data.

Another advancement was the utilization of so called Tensors and Tensor calculations. Tensors are data structures essentially mapping vectors into one larger vector so that all vectors coexist in a common context, the same vector space. This is similar to functions: y=x linearly maps x to y. A Tensor is a mapping of vectors into a common vector space: (K): U=V;W

Tensors are perfect for Big Data because they offer a way of digital data from different contexts and types to be unionized in one Tensor. One could essentially make a Tensor from correlated images, text and sensor data like temperatures and then use that Tensor to compute these value groups en mass. There are prefabricated solutions to this backed by industry giants like Nvidia through hardware or Google through Tensorflow.

Fields of Use

With Big Data there are a lot of applications. One focus of the aforementioned presentation was to outline the impact of Big Data on society, so I will mainly be talking about government intelligence in this article. To give civil use cases I included LAION and science@home, which also are passion projects to me.

Science@Home

Science@Home is a group of researchers initially based around Jacob Sherson of the Aarhus University. The idea was as quintessentially science guy as it was ingenious: Instead of doing calculations and experiments themselves, one could also take the path of lesser resistance and let other people do it for them. Science@Home originated as the video game "Quantum Moves" were players had no right or wrong way to play the game, instead there were multiple more or less optimal solutions for problems. The video games reported player solutions back to the researchers creating a huge dataset of controlled mathematically defined experiments with slightly different input variables resulting in diverse outputs. The analysis of this data could in turn be used to gain knowledge about certain quantummechanical behaviors.

This project was expanded over the years to allow more different experiments to be run resulting in good video games and good (quite big) data. To reiterate: This project used the commonsense of thousands of people to solve problems with Big Data by tricking them into doing logic via playing video games.

LAION

LAION is a non-profit organization / e.V. working on datasets and models in the field of machine learning. Essentially they try to teach a computer to translate text or generate art and images. The people working with LAION essentially use example data previously generated by humans publicly accessible through the internet to make a computer develop thinking procedures which can replicate the input data. This is similar to a child learning to speak by replicating what the parents say. All of this is a massive undertaking and a lesson in Big Data project execution.

The biggest achievement of LAION is their text and image pair database. Analyzing previously automatically collected data from all over the internet ("web scraping") utilizing volunteers computers (similar to science@home) netted a huge dataset linking words to images. This revolutionized the AI world as now proper training for generating things humans do through a computer was now possible. A result of this was the stablediffusion model/method. If properly utilized, this created images indistinguishable from human-made imagery with every content imaginable. Luckily, LAION added filters for disturbing imagery to their dataset, thus there are not as many realistic disturbing images generated by AI out there yet. These filters can easily be bypassed and are not perfect, so disturbing AI-generated imagery exists in non lack-luster amounts.

International Intelligence & OSINT

Governments do know the concept of morale. They usually follow it and try to enforce it, even on a global scale. However, it is sometimes in the interest of a country to play dirty. Governments decide to ignore rules and morale for their own favor sometimes. One example of this is wartime. Whenever a country is at war you can be certain that they will use every tool in their arsenal. Currently (2022/2023) there is an interesting war going on in Ukraine. The nice thing about or digital age is we can nowadays spectate the war from the comfort of our own home.

We have the ability to utilize services similar to Google Earth like SkyWatch or Sentinel Hub to watch individual soldiers and vehicles move out on a mission and can assess the success of that mission without needing to even speak the same language. A lot of people are interested in the outcome of the war, so they tend to keep an eye out for interesting data that may pop up. People tend to organize with groups like Project Owl that collect and assess information from the theatre. The result of this are e.g. maps like liveuamap or uacontrolmap with easily digestible information.

This practice goes back to WWII, when the US Office of Strategic Services employed people to analyze foreign data sources like in this case german or japanese newspapers, history books or even sales figures. This allowed the US to better assess their targets and the strength of both the opposing forces as well as the respective populations.

The value of data like this cannot be understated. It is literally war-winning and can prevent unnecessary missions and thus casualties. Nowadays all of the data sources available mean OSINT dibs into Big Data.

National Intelligence & Surveillance

Big players dealing with a lot of both money and people will always have an interest in knowing their target demographic. This applies to government entities as well as commercial players.

If you are interested in the government side of things, I suggest you look into Edward Snowden and his work regarding PRISM and XKeyscore. The associated concepts are base knowledge and it is somewhat civic duty to inform yourself (likewise with going to elections). Concepts like these are hotly debated, often pushed by politicians and often opposed by activists. Laws like the EUs GDPR are a step in the right direction.

Another (often) state-mandated big player is health insurance. Companies have an interest in both maximizing their profits by paying as little as possible and generating as much income as possible as well as keeping prices low enough to get new customers and thus stay in business. Insurance companies do this by assessing how expensive a customer is for them and letting them pay the associated price. The price assessment is done through Big Data generated by the customers. Certain factors like obesity, smoking or inheritable illnesses in the corresponding family mean a customer is more likely to get ill themselves. Thus the customer is more expensive to the insurance company. Companies setup policies to mitigate both the intake of new high risk customers as well as payment rules. Some treatments might not be covered by health insurance simply because of cost reasons.

The policies of any insurance, really, are decided upon based on customer data. That data comes from multiple streams and is sometimes shared in between agencies (e.g. Schufa with german banks). The data we generate means companies can assess what and how we do things. It also means they can calculate how likely we are to make certain decisions or experience certain events. With our digital life we leave traces everywhere and are essentially transparent humans to players with sufficient data.

Final thoughts

Big Data is an interesting field of Mathematics. The impact generated from a giant mass of small data points is overwhelming. There are some really interesting applications for Big Data necessitating both research and development in the field. Big Data is the next logical evolution of statistics, now also searching for patterns where there previously were none. Big Data will also shape our future decidedly (and does so already). Alongside all the new good, there is also the bad and ugly: Big Data can be used for malicious purposes and endangers both democracy and economic integrity. Big Data is math that requires moral, and that makes it stand out from other fields.