Grocery Store Transaction Data Database Design and Creation

Database diagramming

From a grocery store transaction dataset that was available in one Excel spreadsheet, I created a diagram of the columns from the dataset and how they would relate to each other in a database. From there, I separated each of the tables into .CSV files by saving each of the columns needed from the transactions.exl file to .CSV files with the same names as the tables in the diagram on the previous page. No rows were excluded.

Screen Shot 2021-04-13 at 9.42.30 PM.png

Database creation in Python and Jupyter Notebook

To create the database, the following commands were completed in Jupyter Notebook.

Import SQL:

<import sqlite3
con = sqlite3.connect('Sky_Stage_Project2.sqlite')
cur = con.cursor()

Create the tables in the diagram above:

# Create table - order_item
cur.execute('''CREATE TABLE order_item
             ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [units_sold] integer)''')                
con.commit()
# Create table - orders
cur.execute('''CREATE TABLE orders
             ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [customer_id] integer, [purchase_date] integer)''')                
con.commit()
# Create table - customers
cur.execute('''CREATE TABLE customers
             ([generated_id] INTEGER PRIMARY KEY, [customer_id] integer, [gender] text, [marital_status] text, [homeowner] text, [children] text, [annual_income] text)''')                
con.commit()
# Create table - products
cur.execute('''CREATE TABLE products
             ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [product_family] text, [purchase_department] text, [product_category] text, [revenue] integer, [purchase_date] integer)''')                
con.commit()
# Create table - locations
cur.execute('''CREATE TABLE locations
             ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [customer_id] integer, [city] text, [state] text, [country] text)''')                
con.commit()

Import Pandas library:

import pandas as pd

Insert the values from the table .CSV files into the tables made in the database:

con = sqlite3.connect('Sky_Stage_Project2.sqlite')  
cur = con.cursor()

read_order_item = pd.read_csv ('order_items.csv')
read_order_item.to_sql('order_item', con, if_exists='append', index = False) 
# Insert the values from the csv file into the table 'order_items' 

read_orders = pd.read_csv ('orders.csv')
read_orders.to_sql('orders', con, if_exists='replace', index = False) 
# Replace the values from the csv file into the table 'orders'

read_customers = pd.read_csv ('customers.csv')
read_customers.to_sql('customers', con, if_exists='replace', index = False) 
# Replace the values from the csv file into the table 'customers'

read_products = pd.read_csv ('products.csv')
read_products.to_sql('products', con, if_exists='replace', index = False) 
# Replace the values from the csv file into the table 'products'

read_locations = pd.read_csv ('locations.csv')
read_locations.to_sql('locations', con, if_exists='replace', index = False) 
# Replace the values from the csv file into the table 'locations'

View database through running SQL queries in Python

Look to see if the tables were imported correctly by running some SQL commands:

con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM order_item '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.53.04 PM.png
con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM orders '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.54.15 PM.png
con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM customers '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.55.34 PM.png
con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM products '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.56.18 PM.png
con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM locations '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.57.02 PM.png

Conduct some simple analyses with SQL

con=sqlite3.connect('Sky_Stage_Project2.sqlite')
def x(q):
    return pd.read_sql_query(q,con)

q='''SELECT * FROM products ORDER BY revenue DESC LIMIT 10; '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.58.56 PM.png
q='''SELECT product_category, COUNT(*) FROM orders NATURAL JOIN customers NATURAL JOIN products NATURAL JOIN order_item GROUP BY product_category ORDER BY COUNT(*) ASC LIMIT 30; '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 9.59.41 PM.png
q='''SELECT product_category, COUNT(*) FROM orders NATURAL JOIN customers NATURAL JOIN products NATURAL JOIN order_item GROUP BY product_category ORDER BY COUNT(*) DESC LIMIT 30; '''
   
df=x(q)
df
Screen Shot 2021-04-13 at 10.00.28 PM.png