Navigating Price Elastic Modelling (Part-1): A Journey of Data Cleaning and Insights
Welcome to the enthralling world of Price Elastic Modelling, where I share my exhilarating experience during my summer internship at GlobalLogic, a Hitachi Group company. Our journey revolves around exploring the intriguing realm of price elasticity and its real-world implications.
Join me as I unravel the intricacies of data cleaning in this blog, and stay tuned for upcoming parts, where I’ll delve into the next phases of this fascinating project.
Understanding Price Elasticity of Products
Before diving into data cleaning, it’s essential to revisit the fundamental concepts of price elasticity of products. In my previous blog, I covered the basics of price elasticity and its significance for businesses. Understanding this economic metric is crucial as it helps manufacturers and retailers set optimal product prices to maximize profitability. If you haven’t explored it yet, I recommend you take a look here.
Exploring the Kaggle Dataset
Our journey begins with the “Electronic Product with category” dataset from Kaggle. This rich and diverse dataset contains information about various electronic products, their prices, availability, condition, and more. By leveraging this dataset, we aim to extract valuable insights about product demand and the impact of pricing strategies.
Imprinting Data Cleaning Techniques
Data cleaning is a critical step to ensure the accuracy and consistency of our dataset. In this blog, we will walk through the key data cleaning techniques employed for this dataset, including price availability categorization, condition standardization, date seen and impression count calculation, merchant imputation and categorization, weight standardization, and country engineering. These techniques will help us prepare the dataset for analysis and uncover meaningful patterns.
Price Availability Categorization:
We categorize product availability into “Yes” (in stock), “No” (not present), and “Special” (special order) using powerful regular expressions. This classification helps us understand how product availability affects consumer demand.
# Price Availbility has been categorized in 3 terms -
# ->> Present or in stock - Yes,
# ->> Not present - No,
# ->> Special order - Special
# using regex to sort them out
df['prices.availability'] = np.where(df['prices.availability'].str.contains('Yes|TRUE|In Stock|yes|available', flags=re.IGNORECASE), "Yes",
np.where(df['prices.availability'].str.contains('No|sold|FALSE|Retired|Discontinued', flags=re.IGNORECASE), "No",
np.where(df['prices.availability'].str.contains('Special Order|More on the Way|More Coming Soon', flags=re.IGNORECASE), "Special",np.NaN
)))
Condition Standardization:
To enhance comparability, we standardize product conditions as “New,” “Refurbished,” and “Used”. This step allows us to analyze how the condition of products influences their pricing and demand.
df.loc[df['prices.condition'].str.contains('new',flags=re.IGNORECASE) == True,"prices.condition"] = 'New'
df.loc[df['prices.condition'].str.contains('refurbished',flags=re.IGNORECASE) == True,"prices.condition"] = 'Refurbished'
df.loc[df['prices.condition'].str.contains('pre-owned|used',flags=re.IGNORECASE) == True,"prices.condition"] = 'Used'
Date Seen and Impression Count Calculation:
By converting the “prices.dateSeen” column to a date time format, we can gauge customer interest in products. The impression count reflects the number of times a product was potentially viewed by customers, providing valuable insights into product popularity.
Impression_count=[]
for i in df['prices.dateSeen']:
time_= i.split(",")
Impression_count.append(len(time_))
df['Impression'] = Impression_count
df['prices.dateSeen'] = df['prices.dateSeen'].apply(lambda x: x.split(",")[0])
df['prices.dateSeen'] = pd.to_datetime(df['prices.dateSeen'])
Sales and Discounts:
In the realm of price elasticity, “Sales” and “Discounts” are crucial factors that significantly impact consumer behavior. To better understand customer impressions with changes in product prices, we categorized products on sale based on the “isSale” attribute. The analogy here is that if “isSale” is true, then the “Actual Price” is not equal to the “Discount Price” since there would be some discount applied. Conversely, when “isSale” is false, the “Actual Price” should be equal to the “Discount Price” as there is no discount.
df.loc[((df['prices.amountMax'] != df['prices.amountMin']) & (df['prices.isSale'] == False)),"prices.isSale"] = True
df.loc[((df['prices.amountMax'] == df['prices.amountMin']) & (df['prices.isSale'] == True)),"prices.isSale"] = False
Merchant Imputation and Categorization:
For missing merchant values, we extract information from the “prices.sourceURLs” column and categorize merchants into larger groups like Amazon, eBay, Walmart, and others. Understanding the influence of different merchants on product demand is crucial for retailers and manufacturers.
def Source_to_Merchant(x):
try:
return x.split("www.")[1].split("/")[0]
except:
if len(x)>0:
return x.split("//")[1].split(".com")[0]
return x
df['prices.sourceURLs'] = df['prices.sourceURLs'].apply(lambda x: Source_to_Merchant(x))
df.loc[df['prices.merchant'].isnull(),'prices.merchant'] = df['prices.sourceURLs']
df['prices.merchant'] = np.where(df['prices.merchant'].str.contains('bhphotovideo', flags=re.IGNORECASE), "bhphotovideo.com",
np.where(df['prices.merchant'].str.contains('eBay|e bay', flags=re.IGNORECASE), "ebay.com",
np.where(df['prices.merchant'].str.contains('Amazon',flags=re.IGNORECASE), "Amazon.com",
np.where(df['prices.merchant'].str.contains('Bestbuy',flags=re.IGNORECASE), "Bestbuy.com",
.....
np.where(df['prices.merchant'].str.contains('Wal-mart|Walmart',flags=re.IGNORECASE), "Walmart.com","Other")))))))))))))))
Weight Standardization:
Various weight formats (pounds, ounces, kilograms, and grams) will be converted to kilograms for uniformity. This step allows us to analyze how product weight impacts pricing and demand.
Country Engineering:
We will derive the country of sale based on the currency mentioned in the “prices.currency” column. Analyzing sales trends across different countries will help us identify regional demand patterns.
df['Country']= np.where(df['prices.currency'] == 'USD','USA',
np.where(df['prices.currency'] == 'CAD',"Canada",
np.where(df['prices.currency'] == 'SGD',"Singapore",
np.where(df['prices.currency'] == 'EUR',"EUROPE",
np.where(df['prices.currency'] == 'GBP',"UK","Other"
)))))
Upcoming Sequels
- Exploratory Data Analysis (EDA): In the next sequel, we will perform in-depth EDA to uncover valuable patterns, correlations, and trends within the dataset. EDA will offer insights that aid in making informed business decisions. Stay tuned for the link to this sequel.
- Price Elasticity Model: In the final sequel of this series, we will build the Price Elasticity Model, a powerful tool to understand how price changes impact product demand. This model will help businesses optimize their pricing strategies and maximize profitability. Keep an eye out for the link to this exciting sequel.
Conclusion
Embarking on this Price Elastic Modelling journey has been an enlightening experience. Data cleaning has laid a robust foundation for our analysis, ensuring the accuracy and consistency of our dataset. As we progress, the upcoming sequels of Exploratory Data Analysis and the Price Elasticity Model promise to reveal even more valuable insights. Together, we will unravel the potential of Price Elastic Modelling to empower businesses in a dynamic market. The world of data-driven decision-making awaits — let’s explore its wonders together! Stay tuned for more exciting discoveries in the upcoming sequels.