Mahdi Karami's homepage

My homepage!

View My GitHub Profile

I am a Machine Learning/AI Scientist holding a PhD in Statistical Machine Learning with over 14 years of experience in research and industry. I have a strong background in various AI/ML domains, such as Deep Generative Models, NLP, Computer Vision, Efficient AI, and Graph Neural Networks. My works were published in top AI conferences in collaboration with leading researchers from Google Brain and DeepMind. I also have valuable industrial experience in deep learning, NLP, Foundation Models with proficiency in the Python, particularly PyTorch and TensorFlow. I possess a unique combination of problem-solving, creativity, out-of-the-box thinking, passion for learning, project management and strong communication skills.

Work and Professional Experience

Background and Research Interests

Publications and Research

Large Language Models & Efficient Systems for ML

Abstract In the rapidly evolving landscape of deep learning, the quest for models that balance expressivity with computational efficiency has never been more critical. This paper introduces Orchid, a novel architecture that reimagines sequence modeling by incorporating a new data-dependent convolution mechanism. Orchid is designed to address the inherent limitations of traditional attention mechanisms, particularly their quadratic complexity, without compromising the ability to capture long-range dependencies and in-context learning. At the core of Orchid lies the data-dependent convolution layer, which dynamically adjusts its kernel conditioned on input data using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in the adaptive convolution operation. The dynamic nature of data-dependent convolution kernel, coupled with gating operations, grants Orchid high expressivity while maintaining efficiency and quasilinear scalability for long sequences. We rigorously evaluate Orchid across multiple domains, including language modeling and image classification, to showcase its performance and generality. Our experiments demonstrate that Orchid architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling.

Graph Generative Networks

Abstract Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner. Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-of-the-art performance in terms of graph quality across various benchmark datasets.

Generative Models

Abstract Normalizing flows construct a complex probability density by transforming a simple base density, such as a standard normal distribution, via a chain of smooth, invertible mappings (bijections). Flow-based generative networks can be used to construct high quality generative probabilistic models, but training and sample generation require repeated evaluation of Jacobian determinants and function inverses. In this work, we investigated a set of novel normalizing flows based on circular and symmetric convolutions. It was shown that these transforms admit efficient Jacobian determinant computation and inverse mapping (deconvolution) in 𝒪(𝑁 log𝑁) time. Based on these invertible convolution filters, a nonlinear data-adaptive convolution transformation was proposed where expressiveness is increased by allowing a layer’s kernel to adapt to the layers input. Another outcome of this work was an analytic approach to designing and also better understanding the role of nonlinear gates through the lens of their contribution to latent variables’ distributions. We have shown that specific regularizers, such as sparsity, can be induced on intermediate activations by designing customized pointwise nonlinear gates.

Federated Learning and Privacy Preserving ML

Deep Generative Multi-View (Multi-Modal) Learning

Abstract We proposed an interpretable deep generative framework for multi-view learning based on a probabilistic formulation of canonical correlation analysis (CCA). The model combines a linear multi-view layer in the latent space with deep generative networks as observation models. The proposed model decomposes the variability between views into a shared latent representation that describes the common underlying sources of variation and a set of view-specific components. We designed an efficient learning algorithm using a variational inference procedure incorporating the solution of probabilistic CCA. This also offered a flexible data fusion method in the latent space. Importantly, the proposed model can be generalized to an arbitrary number of views. An empirical analysis confirms that the proposed deep multi-view model can discover subtle relationships between multiple views and recover rich representations.

Sequence Modelling

Abstract Maximum likelihood is typically considered to be hard in this setting since latent states and transition parameters must be inferred jointly. Given that expectation-maximization does not scale and is prone to local minima, moment-matching approaches from the subspace identification literature have become standard, despite known statistical efficiency issues. In this work, we instead reconsidered likelihood maximization of LDS with generalized-linear observation models. Key to the approach was a reformulation of the LDS model as a two-view convex optimization problem that allowed us to approximate the estimation task as a form of matrix factorization, and hence apply recent global optimization techniques. Furthermore, a novel proximal mapping update was analytically derived for this two-view reformulation that significantly simplified the optimization procedure. The resulting algorithm was simple to use and flexible enough to incorporate different losses and regularizers while empirical studies demonstrated that this estimation strategy outperforms widely-used identification algorithms such as subspace identification methods, both in terms of accuracy and runtime.

Wireless Communications

Educations