Why is it so hard to find a job now? Enter Ghost Jobs
Plus, more links to make you a little bit smarter today.
Making a new video every week until I make $5 million - Week 8
Designing and Building Scalable Data Solutions with Snowflake and Databricks
Whether you’re building a modern Data Lake or a robust Data Warehouse, selecting the right platform is key. Two industry leaders—Snowflake and Databricks—offer powerful, scalable, cloud-native architectures that enable organizations to harness the full potential of their data. In this post, we’ll be doing a bit of a quadruple feature, covering both data lakes and warehouses as well as how you can use Snowflake and Databricks to your advantage!
Why I love Bluey (and hate Cocomelon)
These are the two most-streamed children's shows. Joe Brumm's personal touch for Bluey trumps Cocomelon's engagement-hacking approach.
Why is it so hard to find a job now? Enter Ghost Jobs
This study investigates the emerging phenomenon of "ghost hiring" or "ghost jobs", where employers advertise job openings without intending to fill them. Using a novel dataset from Glassdoor and employing a LLM-BERT technique, I find that up to 21% of job ads may be ghost jobs, and this is particularly prevalent in specialized industries and in larger firms. The trend could be due to the low marginal cost of posting additional job ads and to maintain a pipeline of talents. After adjusting for yearly trends, I find that ghost jobs can explain the recent disconnect in the Beveridge Curve in the past fifteen years. The results show that policy-makers should be aware of such a practice as it causes significant job fatigue and distorts market signals.
Why Property Testing Finds Bugs Unit Testing Does Not
I intended this newsletter to be my thoughts without editing, and I have a new thought, so here goes. I want to respond to this discussion:
Wikipedia in the Era of LLMs: Evolution and Risks
In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia's recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents
The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination-generating false information-and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models' outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized "I don't know" responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.