An Introduction to Vision-Language Modeling

Plus, more links to make you a little bit smarter today.

Dec 27, 2024

I Made A Fundraiser For My Upcoming Book

In Defense of the Jumpscare

One of the most controversial elements of horror in the past few decades has been the use of jumpscares: sudden, loud attacks on the audience. They have been deemed cheap, lazy, and overall just a nuisance. But as a diehard horror fan myself, I’ve realized the error is not with the jumpscare — its with how people use them.

This work introduces a framework for evaluating onchain order flow auctions (OFAs), emphasizing the metric of price improvement. Utilizing a set of open-source tools, our methodology systematically attributes price improvements to specific modifiable inputs of the system such as routing efficiency, gas optimization, and priority fee settings. When applied to leading Ethereumbased trading interfaces such as 1Inch and Uniswap, the results reveal that auction-enhanced interfaces can provide statistically significant improvements in trading outcomes, averaging 4- 5 basis points in our sample. We further identify the sources of such price improvements to be added liquidity for large swaps. This research lays a foundation for future innovations in blockchain based trading platforms.

Music Genre Classification: Training an AI model

Music genre classification is an area that utilizes machine learning models and techniques for the processing of audio signals, in which applications range from content recommendation systems to music recommendation systems. In this research I explore various machine learning algorithms for the purpose of music genre classification, using features extracted from audio signals.The systems are namely, a Multilayer Perceptron (built from scratch), a k-Nearest Neighbours (also built from scratch), a Convolutional Neural Network and lastly a Random Forest wide model. In order to process the audio signals, feature extraction methods such as Short-Time Fourier Transform, and the extraction of Mel Cepstral Coefficients (MFCCs), is performed. Through this extensive research, I aim to asses the robustness of machine learning models for genre classification, and to compare their results.

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [EOS] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25%+.

Asset and Factor Risk Budgeting: A Balanced Approach

Portfolio optimization methods have evolved significantly since Markowitz introduced the mean-variance framework in 1952. While the theoretical appeal of this approach is undeniable, its practical implementation poses important challenges, primarily revolving around the intricate task of estimating expected returns. As a result, practitioners and scholars have explored alternative methods that prioritize risk management and diversification. One such approach is Risk Budgeting, where portfolio risk is allocated among assets according to predefined risk budgets. The effectiveness of Risk Budgeting in achieving true diversification can, however, be questioned, given that asset returns are often influenced by a small number of risk factors. From this perspective, one question arises: is it possible to allocate risk at the factor level using the Risk Budgeting approach? First, we introduce a comprehensive framework to address this question by introducing risk measures directly associated with risk factor exposures and demonstrating the desirable mathematical properties of these risk measures, making them suitable for optimization. Then, we propose a novel framework to find portfolios that effectively balance the risk contributions from both assets and factors. Leveraging standard stochastic algorithms, our framework enables the use of a wide range of risk measures to construct diversified portfolios.

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

The Astukari Newsletter

Discussion about this post

The Astukari Newsletter

An Introduction to Vision-Language Modeling

Plus, more links to make you a little bit smarter today.

I Made A Fundraiser For My Upcoming Book

In Defense of the Jumpscare

Quantifying Price Improvement in Order Flow Auctions

Music Genre Classification: Training an AI model

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Asset and Factor Risk Budgeting: A Balanced Approach

An Introduction to Vision-Language Modeling

Discussion about this post