Link Search Menu Expand Document

NLP for Cannabis Text Data

Partner: Heather Haveman, UC Berkeley, Academic

Overview

Project Description

For this project research apprentices will use Python to write code scraping, munging, and classifying product data to better understand the dynamics of the United States cannabis industry. Apprentices will apply their programming skills to 1.) scrape product data from publicly available websites; 2.) turn messy unstructured data sets into shiny clean data sets available for reproducible research, and 3.) apply the latest techniques in natural language processing to find trends and patterns in product description data. These data science techniques will help us uncover the political and cultural elements that affect market competition in the US cannabis industry.

Professor Heather Haveman and her co-author (Cyrus Dioun) will supervise and train students in best practices for developing a data pipeline and for applying machine learning techniques for natural language processing. We will apply these techniques to explore how variation in state and local laws regarding cannabis’s legal status affects the way that cannabis retailers market their products. Are certain types of merchandising and pricing norms more likely to emerge in certain areas depending on the form of government regulation and variation in types of property rights? Do companies strategically describe their products in similar or different ways to maximize revenue? Participants will help us answer these questions by writing code to clean and analyze product descriptions and user reviews collected from the United States cannabis industry. Students will work with machine learning packages for text analysis (convolutional neural networks, cosine similarity, etc.) to help analyze millions of observations collected via our web scrapers.

Expected Deliverable

Clean data set with variables constructed using NLP.

What would a successful semester look like to you?

Engaged students, reproducible code, robust findings.

Data

Models

Conclusion