Building AI: Layers of Innovation that Shaped the Past, Present, and Future - Part 2 - Blog

Welcome to the second installment of our five-part series, where we explore the transformative world of Big Data and its pivotal role in advancing artificial intelligence (AI). In this piece, we explore the evolution from a data-starved AI landscape to one flourishing with an abundance of diverse, high-quality data, a shift that has been fundamental to the development of sophisticated AI models.

Layer 2: The Rise of Big Data

The Problem

In the early days of AI, there was a scarcity of diverse, well-organized data available. The little existing data was scattered across different formats and systems, often trapped behind proprietary barriers. This situation was akin to having a powerful car with no fuel. Training algorithms need vast amounts of diverse, real-world data to generalize well and become useful, and the need for such data was a severe hindrance to the progress of AI.

Big Data Revolution

The explosion of the internet, social media, IoT devices, and digital technologies generated a deluge of data. Suddenly, a treasure trove of information was available - textual data from websites, images and videos from social media, transactional data from businesses, sensor data from industrial machinery, and so on.

Before Big Data

The need for more extensive and varied datasets meant that training robust AI models was an uphill task. For example, attempting to train a machine learning model to recognize human speech or sentiment across different languages and accents was nearly impossible due to the need for more diverse speech data.

After Big Data

With vast and varied datasets available, complex AI models could be trained with high accuracy. A prime example is OpenAI's GPT models, which have been trained on a mixture of licensed data, human-created data, and publicly available text. This kind of extensive data allowed the creation of a model capable of understanding and generating human-like text across various languages and contexts.

In the case of GPT-3, it was fed with 45 terabytes of text data after preprocessing. Without such an extensive and diverse dataset, creating a model with its level of sophistication and capability would not have been attainable.

Vector Databases and Beyond

Traditional SQL or No-SQL databases were not designed to handle the complex needs of AI models, especially in similarity search and high-dimensional data. New technologies like vector databases emerged to fill this gap. Vector databases such as FAISS and Annoy allowed for efficient storage and retrieval of high-dimensional data points, essential for tasks like similarity search in image or text data. These databases enabled more efficient training and utilization of AI models, contributing to the boom in AI capabilities.

This revolution in data accessibility set the stage for AI models to learn, adapt, and excel, making today's cutting-edge AI applications not just a theoretical possibility but a practical reality.

Fun Fact

Around 90% of the world's data has been created in the last two years! This exponential growth in data availability, much of which is harvested from social media, IoT devices, and other digital sources, has been a critical driver of AI's success.

Timeline

2001: Doug Cutting releases Lucene, laying the groundwork for Hadoop and big data processing.
2005: Yahoo! uses Hadoop, a major big data framework, to index the internet.
2011: IBM's Watson wins at "Jeopardy!" demonstrating the power of big data in AI.
2014: Apache Spark is released, enabling faster big data processing and analytics.
2017: Launch of commercial big data platforms like Google BigQuery, and Amazon Redshift.
End Date: Ongoing (as big data technologies and availability continue to grow)