The entire history of deep learning can be seen as a series of advancements that have gradually raised the celling on these constraints, enabling the creation of increasingly intelligent systems.
The post is essential in understanding the evolution of Deep Learning that got us from simple feed-forward neural networks to Agentic AI (buzzword of 2025). I tried to rephrase everything from this gem article and add my own perspectives too. I learned so much from the said article, probably because it perfectly emphasizes a clear trend across all the key advancements.
Artificial Intelligence, particularly deep learning, faces a set of fundamental constraints that shape its capabilities and progress. These constraints can be grouped into seven key categories: data, parameters (model size), optimization and regularization, architecture, compute power, compute efficiency, and energy consumption. Understanding these limitations is essential for making sense of how we reached the current state of the field, as well as for anticipating where future developments may lead. Without this perspective, it is impossible to fully appreciate the trajectory of AI research and its potential directions.
A comprehensive grasp of these constraints enables us to reflect on several core questions: How is progress is deep learning achieved? Where do transformative ideas originate? How do we even define intelligence? What does the future of deep learning look like? And, ultimately, will incremental improvements in these systems be sufficient to replace humans in certain domains? These questions are not merely philosophical; they carry significant implications for both technology and society.
From an economic standpoint, intelligence can be viewed as the ability to effectively model reality. In this sense, the value of intelligent systems lies in their capacity to construct complex statistical representations of the world that support the execution of economically meaningful tasks. Today, nearly all practical AI systems are designed with this objective in mind: building models of reality that can deliver utility across a wide range of applications.
Within this framework, the central goal of deep learning becomes clear: to develop accurate models of reality for useful tasks. This involves three interconnected steps: first, recognizing that the true structure of reality can be represented as highly complex probability distributions; second, designing neural networks with sufficient representational power to capture these distributions; and third, training these networks to approximate the underlying patterns and laws that govern the real world.
This process reduces to two fundamental pillars. The first is data collection, which involves gathering relevant and high-quality information about reality. The second is data modeling, which focuses on designing and training neural networks that can learn effectively from this information. The overall capability of any AI system depends entirely on how well these two steps are executed in practice.
With this perspective in place, we can now examine each of the seven constraints in greater depth, understanding how they influence the progress, limitations, and potential of modern artificial intelligence.
Data
Constraint #1: A model can only be as good as the dataset it was trained on.
We have established that the fundamental objective of deep learning is to model the probability distributions that define reality.
For any specific task, we refer to the distribution we aim to model as the true distribution. To closely approximate this distribution, we collect a lot of samples, forming a dataset. While this dataset contains information about the true distribution, it does not capture it in its entirety. As a result, the dataset serves as an approximation, which we designate as the empirical distribution.
At best, a neural network can learn to model this empirical distribution. However, since our ultimate objective is to approximate the true distribution, it is crucial that the empirical distribution serves as an effective proxy. The fidelity of this approximation sets an upper bound on the potential accuracy of any model trained on the dataset.
The degree to which a model can approximate the true distribution is constrained by the information embedded within the dataset. Improving the dataset's representational fidelity requires increasing the total amount of relevant information it contains. This can be achieved by:
- Improving data quality - Ensuring that each sample provides more informative insights about the true distribution.
- Increasing data quantity – Expanding the dataset with additional samples that introduce new, valuable information.
It is extremely important to note that simply increasing the volume of data does not inherently improve model performance. Instead, the key objective is to enrich the dataset with meaningful information that enables the model to construct a more accurate approximation of reality. There are three major breakthroughs that have significantly addressed this data constraint.
Large Labeled Datasets
Early machine learning models were constrained by limited datasets, typically curated by individual research teams. Despite the theoretical advantages of deep learning, these models struggled to outperform traditional machine learning techniques due to insufficient training data.
The introduction of large, labeled datasets such as MNIST and ImageNet marked a pivotal moment in deep learning. These datasets provided the necessary scale and quality to demonstrate the superiority of deep neural networks over conventional approaches. Models like LeNet and AlexNet leveraged these datasets to establish deep learning as a competitive paradigm.
Although these datasets are now obsolete, their impact was profound. For instance, AlexNet, which absolutely changed the field of deep learning, would not have been possible without ImageNet. This advancement represented the first significant step in overcoming data limitations by expanding dataset scale.
However, reliance on manually labeled datasets proved inherently unscalable. To push the boundaries further, a new approach to data collection was required.
Internet-Scale Data
The internet represents the largest repository of human-generated data, yet its application in deep learning was initially unclear. Unlike curated labeled datasets, internet data is not specifically structured for model training, making it challenging to extract high-quality information for a given task.
The introduction of BERT fundamentally transformed this paradigm. By pioneering the transfer learning approach now used by all modern large language models (LLMs), BERT demonstrated how internet-scale datasets could be leveraged effectively. The model was pre-trained on vast internet data (high volume, variable quality) and subsequently fine-tuned on smaller, high-quality datasets.
This innovation not only enabled more powerful models but also shifted industry perspectives on AI’s potential. Notably, a Google executive suggested that advancements in AI could eventually render traditional search engines obsolete. The LoRA paper further explored why transfer learning, as introduced by BERT, has been so effective.
Training Assistants
While BERT and GPT models showcased impressive technical capabilities, their widespread adoption remained limited until the emergence of ChatGPT.
This transition was facilitated by InstructGPT, which leveraged Reinforcement Learning from Human Feedback (RLHF) to refine the base GPT-3 model. By training on a dataset of human-generated question-answer pairs optimized for helpful responses, InstructGPT developed a structured and practical communication style, making it highly effective as an AI assistant.
The success of InstructGPT underscored the importance of high-quality data in fine-tuning models. Despite the existence of numerous fine-tuned models, InstructGPT outperformed its predecessors due to the superior quality of its training data.
The Future of Data Sources
As internet-generated content continues to expand exponentially, it remains a valuable source for training deep learning models. However, the quality of internet-scale datasets poses a critical challenge. Whereas deep learning aims to model the reality, internet data serves as an imperfect and highly compressed representation of the real world.
To address this limitation, alternative data sources - such as insights collected from humanoid robots - have become increasingly important. Direct interactions with the physical environments would definitely provide richer and more precise data for AI models. This potential has been reflected in the recent OpenAI and Microsoft's collaboration with Figure, a company specializing in humanoid robots.
That said, current scaling laws indicate that existing models have not yet reached the full capacity of learning from internet-scale data. As a result, data is not yet the primary constraint on AI development.
Now, we can begin examining the factors that govern how effectively a neural network can model data.
As previously mentioned, the primary determinant of a model's ability is how well it could approximate the empirical distribution, and one crucial factor if the number of parameters within the neural network.
Parameters
Constraint #2: Model capacity if limited by the number of parameters.
The utmost accuracy of the empirical distribution of the dataset can be attributed to the representational capacity of the neural network. This requires the model to have enough parameters to provide the necessary degrees of freedom for accurately capturing the underlying patterns within the data.
Determining the exact number of parameters needed for a dataset is super hard. However, when the model shows sign of underfitting the data, the most straightforward solution is to scale up the number of parameters. That is, to increase the depth of the network and add more parameters per layer. And given the sheer complexity of modern internet-scale datasets, this method has consistently proven effective in handling model performance, provided that enough compute power is at disposal.
The representational capacity of a neural network is fundamentally constrained by the number of parameters it contains. However, scaling up a model’s parameters is not an independent process—it is inherently tied to other constraints, such as data quality, computational resources, and optimization techniques.
To understand the significance of this constraint, we can examine historical breakthroughs where increasing model capacity really advanced deep learning.
Increasing Depth
Early neural networks were limited to single-layer architectures, significantly restricting their ability to model complex functions. The introduction of hidden layers, as discussed in the original backpropagation paper, dramatically increased the representational power of neural networks. This enabled models to solve problems that were previously infeasible, such as shift registers and XOR logic gates—relatively simple by modern standards but groundbreaking at the time.
A landmark example of depth-driven improvement was AlexNet, which demonstrated the direct impact of increasing parameters on performance. With five convolutional layers, AlexNet significantly outperformed all previous models in the ImageNet competition, showcasing the power of deeper architectures. It is often joked, of course, that after this monumental event that we became completely oblivious to how these neural networks actually learned from the data on their own.
However, during this period, model size was only one of many factors constraining neural network performance, rather than the dominant limitation.
Scaling Laws
The GPT series provided a fundamental shift in understanding how scaling affects performance. These models demonstrated that, for internet-scale datasets, simply increasing the number of parameters led to consistent and substantial improvements.
Scaling laws have since confirmed that model performance follows predictable trends as a function of parameter count. This empirical observation has incentivized the creation of larger and more powerful models.
Importantly, the success of scaling is not due to a general principle that "more parameters always lead to greater intelligence." Rather, it is a reflection of the fact that current models still lack the representational capacity to fully capture the complexity of internet-scale datasets.
Optimization
Constraint #3: The efficacy of optimization & regularization approaches constrains the number of parameters a network can handle while still being able to converge and generalize.
While increasing the number of parameters in a model appears the most blatant way to enhance the performance, this approach is not infinitely scalable. In fact, it could introduce two issues:
- Slower convergence: more complex matrix computations means slower time to converge to an optimal solution, or even reach one at all.
- Overfitting: the model starts to capture trivial noise rather than meaningful patterns.
This is when we need to come up with other intricate solutions.
Taming Gradients
A critical challenge in training deep networks is the vanishing and exploding gradients problem, which arises due to the multiplication of weights over many layers. This leads to gradients either shrinking to near-zero (preventing learning) or growing uncontrollably (causing instability).
For years, this issue severely limited the feasible depth of neural networks. It was not until the introduction of ResNet (Residual Networks) that allowed gradients to flow uninterrupted through deep architectures. This innovation effectively removed a major depth constraint, enabling models with hundreds (and later thousands) of layers.
Dropout
Overfitting remained a persistent problem, as large models would memorize noise rather than learn meaningful representations. A naive solution to this was ensemble learning, where multiple independent models were trained on the same task, and their outputs averaged. However, this approach was computationally prohibitive.
Dropout provided a computationally efficient alternative by randomly deactivating a subset of neurons during each training iteration. This effectively forced the model to simulate an ensemble of subnetworks, greatly improving generalization.
Taming Activations
As networks grew deeper, another issue emerged: internal covariate shift. This occurs when the activation distributions in early layers change dynamically, disrupting learning in later layers and thus, slower convergence.
Batch Normalization (BatchNorm) and Layer Normalization (LayerNorm) addressed this by standardizing activations, ensuring they remained within stable ranges.
- BatchNorm was particularly effective in stabilizing deep convolutional networks.
- LayerNorm played a crucial role in enabling deeper recurrent architectures like RNNs and LSTMs, paving the way for Transformer models.
These techniques, along with residual connections, removed key depth constraints, making deep learning architectures significantly more stable.
Momentum-Based Optimization
The earliest optimization algorithms, such as Stochastic Gradient Descent (SGD), updated model parameters at fixed step sizes based on instantaneous gradients. This often resulted in inefficient convergence and slow training.
The introduction of momentum-based optimizers, particularly Adam, significantly improved optimization efficiency. Adam uses adaptive moment estimation to track gradient history, allowing step sizes to adapt dynamically. This often results in faster and more stable convergence.
Today, Adam (or its variants) remains the default optimization algorithm for training deep learning models.
Important. Nowadays, most advanced models up-to-date all utilize the aforementioned optimization techniques. For instance, the The Transformer Model architecture uses Dropout, Layer Normalization, Residuals throughout its encoder and decoder layers, and was trained using the Adam optimizer.
As a result, many of the previous constraints on depth and generalization have been effectively mitigated. Moreover, with internet-scale datasets, models remain under-parameterized, meaning overfitting is not currently a major concern.
Optimization and regularization remain foundational constraints for model scalability. As architectures continue to evolve, future advancements may be necessary to overcome new bottlenecks in training ever-larger models.
Update: With the recent introduction of DeepSeek R1, we have, for the first time ever, witnessed a model of capabilities parallel to those of OpenAI's GPT o1 with only 1/20 the cost of training. This is absolutely crucial because, now we do not have to necessarily rely on compute power to increase the model's performance.
Architecture
Constraint #4: The quality of the network architecture constrains the representational capacity of a model.
Neural network architecture plays a crucial role in defining how effectively a model can utilize its parameters. By embedding inductive bias - predefined structures that align with the problem being solved—models can store and process information more efficiently. For example, rather than allowing a deep neural network to learn spatial relationships from scratch, Convolutional Neural Networks (CNNs) explicitly model these relationships, making image processing far more effective.
Feature Maps
The first major architectural advancement in deep learning was the introduction of CNNs. Inspired by the human visual system, CNNs introduced a powerful inductive bias for handling image data. Specifically, they process images using feature maps that detect spatial patterns, allowing them to recognize objects and high-level features regardless of their position in an image (Translational variance).
CNN completely:
- Enabled the creation of successful models like LeNet, AlexNet.
- Outperformed traditional handcrafted feature engineering.
- Still relevant, even in the task of image generation.
Time-series Data
While CNNs enabled breakthroughs in image processing, deep learning models struggled with sequential data (e.g., text, speech, time series). The introduction of Recurrent Neural Networks (RNNs) allowed models to store and process information over time, introducing memory into neural networks.
However, vanilla RNNs suffered from short-term memory limitations, as information decayed over long sequences, due to vanishing gradients. It was not until the emergence of Long Short-Term Memory (LSTM) architecture that it was possible to learn long-range dependencies. LSTMs were a major breakthrough in language modeling, but they still had a limitation: they processed data sequentially, making them slow to train.
Attention Mechanism
A significant leap in architecture came with the introduction of the attention mechanism. Initially used as an enhancement for LSTMs, Attention allowed models to focus on the most relevant parts of an input sequence. However, the Transformer architecture (introduced in the 2017 paper Attention Is All You Need) removed recurrence entirely, relying solely on self-attention to model sequences. This was revolutionary at the time because:
- High parallelization → Unlike RNNs and LSTMs, Transformers process entire sequences simultaneously, making them significantly faster.
- Better long-range dependency modeling → Attention mechanisms allow models to directly connect distant words in a sentence, eliminating the limitations of LSTMs.
This single innovation led to GPT, BERT, and modern large language models (LLMs), permanently shaping the field of deep learning.
Randomness
While CNNs enabled models to analyze images, image generation remained a major challenge. This is because to make images 'out of thin air', models would need to learn to create both high-level features and complex details.
- Variational Autoencoders (VAE), on the one hand, introduced latent space representations, where a model first learns a compressed representation of an image and then reconstructs it with added noise.
- Diffusion Models, on the contrary, start with pure noise and gradually denoise it into an image.
Diffusion models (e.g., Stable Diffusion, DALL·E) now outperform GANs and are the state-of-the-art in generative image modeling.
Embeddings
Another major advance in neural network architecture was the introduction of embedding spaces. Embeddings are numerical representations of real-world data, such as text, speech, images, and videos. Embeddings can provide compact representations of different data types while simultaneously allowing comparison of two different data objects to determine their similarity or difference.
- Word2Vec demonstrated that word meanings could be captured in high-dimensional vectors, enabling semantic algebra (e.g., King - Man + Woman = Queen).
- The CLIP model, based on Transformers, extended this concept to multimodal learning, mapping text and images to a shared embedding space.
Perhaps, the utmost captivating property of embeddings is their versatility: literally any data structure can be represented as a vector of high-dimensional, paving ways for efficient storage, search, and retrieval of data.
Since the introduction of Transformers, architectural innovation has slowed down significantly. Instead of designing new architectures, research has largely focused on:
- Scaling existing Transformer models (e.g., GPT-4o, Claude, Gemini)
- Combining different architectures (e.g., Diffusion models with CNNs, CLIP with Transformers)
Compute Power
Constraint #5: The total available compute constraints the maximum number of trainable parameters a model can have.
Even with efficient architectures, effective optimization, and strong regularization, there is still a fundamental constraint on deep learning models: compute availability.
During training, every parameter requires gradient computation and updates at each time step. As parameter count grows, the computational cost scales up, making backpropagation the limiting factor.
Parallelized Training
Reiterating on the revolution that AlexNet ushered in, we also saw that it was the first deep learning model to leverage GPUs for neural network training, accelerating computation significantly. More importantly, it was also the first model trained across multiple GPUs simultaneously, using inter-GPU memory sharing—an innovation originally designed for gaming. This has eventually led to modern distributed training, allowing models to be trained on thousands of GPUs at once.
Riding Tailwinds
Until the past decade, the advancement of GPUs were not incentivized by deep learning, but rather by the tailwinds of the gaming market. In this way, deep learning benefited from a bit of luck - the compute tailwinds created by the gaming industry enabled deep learning to take off in a way that likely would not have happened in the absence of gaming. It was not until 2020 that NVIDIA released their A100 model built specifically for AI applications. This marked a turning point: compute was no longer just a byproduct of gaming - it became the primary target.
Since then, AI-specialized GPUs have emerged (e.g., H100, Google TPUs, Apple Neural Engine). The upcoming B100 GPUs are promised to push AI compute efficiency even further.
The Compute Arm Race
The power law scaling trends seen in BERT, RoBERTa, GPT-3, etc. showed that more compute = more intelligence. As a result, major tech players rushed to acquire compute, leading to GPU shortages and an explosion in demand. It even led Sam Altman to declare compute as "the currency of the future." This surged in demand has skyrocketed NVIDIA stock values, making it the most valuable companies in till now (January 2025)
Compute Efficiency
Constraint #6: The software implementations for training constrain the efficiency of compute utilization.
As compute power increases, effectively utilizing this power is not guaranteed—it requires software-level innovations to optimize resource usage. While adding more GPUs can improve performance, software inefficiencies waste computational power, making model training slower, costlier, and less scalable.
CUDA - Making GPUs Accessible for Deep Learning
Initially, programming GPUs was complex because they required a low-level parallel programming paradigm. The introduction of NVIDIA's CUDA (Compute Unified Device Architecture) bears striking resemblance to C-like programming model, making GPU programming far more approachable to researchers.
AlexNet (2012) used CUDA to implement custom GPU kernels for fast convolution operations, unlocking CNN parallelization.
Kernel Libraries
People rarely have to write low-level kernels anymore since popular libraries like PyTorch and JAX have already written the kernel code for the most popular kernels, making it easy for modern deep learning engineers to use GPUs without needing to dip into low-level code.
With that being said, they are still opportunities for improving the compute efficiency of model implementations.
Energy
Constraint #7: The energy available to draw from the grid in a single location constrains the amount of compute that can be used for a training run.
Even if compute supply chains meet demand and resources are unlimited, compute is still constrained by energy availability. Large-scale AI training requires physically clustered compute in data centers, which must support the energy needs of thousands of GPUs operating simultaneously. As AI models scale, the energy demands of training runs grow exponentially, potentially hitting hard physical and regulatory limits.
Concluding Notes
Maximizing leverage constraints are important for individual training runs, but improving the hard constraints is what really pushed forward the increasing base intelligence of models now.
Having considered each constraint individually, let's categorize them in terms of hard constraints and leverage.
The hard constraints are data, compute, and energy - these are rate-limited by slow processes - data currently being limited by the scaling growth of the internet and other data collection methods, compute being limited by individual company resources and supply chains, and energy constraints eventually being rate-limited by regulation.
Meanwhile, parameters, optimization & regularization, architecture, and compute efficiency can be thought of as forms of leverage on the hard constraints - they are all easy to vary and can be optimized to maximize a models intelligence given a fixed set of data, compute, and energy.
This is again indicative of the scaling laws - our models have not shown signs of coming close to fully modeling the information in current internet-scale datasets, so we continue to scale up models by increasing compute and parameters.
An important question is: What can this progression of progress in deep learning tell us about our own intelligence? As discussed in the beginning, one way to view intelligence is as a measure of the ability to model complex probability distributions that describe reality, and then run active inference on neural networks to accomplish things (usually of economic value) in the world. It seems that the combination of data, compute, energy, scaling, and obviously, a super powerful learning algorithms all constitutes systems that appear intelligent. If intelligence really is just a function of data, compute, energy, and training, then it might seem inevitable that artificial intelligence will soon surpass us.