OR & Data Science Stories

Solving Single Depot Capacitated Vehicle Routing Problem Using Column Generation with Python

2019-12-15T00:00:00+00:00

Vehicle routing problem (VRP) is identifying the optimal set of routes for a set of vehicles to travel in order to deliver to a given set of customers. When vehicles have limited carrying capacity and customers have time windows within which the deliveries must be made, problem becomes capacitated vehicle routing problem with time windows (CVRPTW). In this post, we will discuss how to tackle CVRPTW to get a fast and robust solution using column generation.

Problem

We consider a pizza restaurant chain, PPizza, in the Los Angeles, CA area with 34 stores. Each store operates from 10am to 1am everyday. PPizza offers three pizza sizes (small, medium, large) with various toppings and soft drinks. Pizzas are prepared with fresh ingredients and baked in store on demand.

PPizza forecasts weekly demand of food items for each store and identifies required ingredients and soft drinks. Fresh ingredients are delivered to stores daily from the main depot once a day. Soft drinks are delivered and replenished by suppliers directly.

Figure 1 shows location of stores and the depot. Each store has time windows between 9am and 3pm where delivery needs to be done within. Unloading time varies by store depending on location and parking availability.


Figure 1: PPizza depot and stores

Trucks can leave from the depot at 6am and need to return the depot by 5pm. Each truck can be used once and has a limited capacity of $60$ lbs.

Since delivery cost is a function of number of trucks used in delivery, minimizing the total number of trucks used for delivery minimizes total cost. We want to identify the truck operating schedules to be able to deliver fresh ingredients to each store with given time windows by minimizing the total cost (minimizing the number of trucks used).

Analysis

We first formulate the problem as a mixed integer program. Then, we solve the problem for a range of number of available trucks using the formulation. Since CVRPTW is NP-hard, we expect that model run time increases as number of available trucks decreases.

We also develop column generation based algorithm to solve the problem.

Finally, we compare performance of two solution methodologies; mixed integer program and column generation.

General Formulation

We develop a mixed integer model for the PPizza delivery problem as follows.

We solve the mixed integer model using Python with PuLp. The following is the implementation.

import pandas as pd
import timeit
import time
from threading import Thread, currentThread
import queue
from cvrptw_optimization import single_depot_general_model_pulp as sm

# Read input data
customers = pd.read_pickle(r'data/customers.pkl')
depots = pd.read_pickle(r'data/depots.pkl')
transportation_matrix= pd.read_pickle(r'data/transportation_matrix.pkl')
vehicles = pd.read_pickle(r'data/vehicles.pkl')

# Model parameters
bigm_input=transportation_matrix.DRIVE_MINUTES.max()*20
solver_time_limit_minutes_input = 480

# Calculate range for vehicles
min_vehicles = int(round(customers.DEMAND.sum()/60)+2)
max_vehicles = len(vehicles)+1

# Define functions
def run_single_depot_general_model(vehicle,
                                   depots, 
                                   customers, 
                                   transportation_matrix, 
                                   vehicles,
                                   bigm_input,
                                   solver_time_limit_minutes_input):
    """
    Run general model
    """
    start = timeit.default_timer()
    vehicles_sub = vehicles.head(int(vehicle))
    print(len(vehicles_sub))
    solution_objective, solution_paths = sm.run_single_depot_general_model(depots, 
                                                                           customers, 
                                                                           transportation_matrix, 
                                                                           vehicles_sub,
                                                                           bigm=bigm_input,
                                                                           solver_time_limit_minutes=solver_time_limit_minutes_input)
    solution_paths['OBJECTIVE'] = solution_objective
    solution_paths['NUMBER_OF_VEHICLES'] = vehicle
    stop = timeit.default_timer()
    solution_paths['MODEL_RUN_TIME_MINUTES'] = (stop - start)*60
    solution_paths.to_csv(r'general model solutions/{}_.csv'.format(str(vehicle)), index=False)
    return 'ok'
    
q = queue.Queue()
def worker():
    """
    Worker function to process vehicles from a queue (q).
    """
    while True:
        print("Start thread worker.")
        vehicle = q.get()
        print("Starting vehicle: {}".format(str(vehicle)))
        run_single_depot_general_model(vehicle,
                                       depots, 
                                       customers, 
                                       transportation_matrix, 
                                       vehicles,
                                       bigm_input,
                                       solver_time_limit_minutes_input)
        print("Finishing vehicle: {}".format(str(vehicle)))
        q.task_done()
        print("End thread worker")
        
def create_and_process_queue(vehicle_range_list, max_num_threads):
    """
    Creates a queue of vehicles to process. Creates threads to process the queue. 
    The number of threads are limited by max_num_threads.
    """
    # add the vehicles to the queue
    for vehicle in vehicle_range_list:
        print("Adding vehicle {} to queue".format(str(vehicle)))
        q.put(vehicle)
    print("Create threads")
    for i in range(max_num_threads):
        time.sleep(10)
        t = Thread(target=worker)
        t.daemon = True
        t.start()
    q.join()  # blocks until all queue items have been processed
    
min_vehicles = int(round(customers.DEMAND.sum()/60)+2)
max_vehicles = len(vehicles)+1

vehicle_range_list = []
for vehicle in range(min_vehicles, max_vehicles):
    vehicle_range_list.append(vehicle)
vehicle_range_list.reverse()

create_and_process_queue(vehicle_range_list, 5)

You can install cvrptw_optimization package to your conda environment using the following code.

pip install cimren-cvrptw-optimization

We ran the model for the total number of vehicles, $|K|$ , from 30 to 11. We set the maximum model run time to 480 minutes (8 hours). There is no solution for $K=11$ and $K=12$ since maximum model run time is reached.

Figure 2 illustrates routing, model objective, and run time minutes for each number of available vehicles set.


Figure 2: General model solution

As we use less number of vehicles, total delivery hours is reduced by about an hour per vehicle removed.

Best solution is obtained when $K=13$ (Figure 3). Model run time is approximately 6 hours. Model objective is $16.8$ which is total drive hours.


Figure 3: Best general model solution

We now implement column generation methodology.

Column Generation

We develop a column generation approach based on Dantzig-Wolfe decomposition. CVRPTW is decomposed into two problems, the master problem, and the subproblem to provide better bound when linear relaxation of the problem is solved.

The master problem considers only a subset of variables from the original while the subproblem identifies the new variables. The objective function of the subproblem considers the reduced cost of the new variables with respect to the current dual variables. The outline of branch-and-price algorithm is illustrated in Figure 4.


Figure 4: Column generation algorithm

In the column generation algorithm, the master problem is solved using an initial solution. It can be any feasible solution that meets all constraints. In this case, we start with the depot-store-depot routes. From this step, the dual prices of each constraint in the master problem are obtained. Then, the reduced cost is calculated and utilized in the objective function of the subproblem. After solving the subproblem, the variables (called columns in the master problem) with negative reduced cost must be identified. These variables are then added to the master problem and resolved iteratively. The process is repeated until the subproblem solution has only non-negative reduced costs columns. Theoretically, at that instance, the solution of the master problem is the optimal solution.

Master Problem

We consider all feasible single vehicle routes, $L$ , with respect to vehicle capacity that start and end at the same depot. Master problem selects sets of routes which minimizes total transportation cost.

Subproblem

The subproblem attempts to generate feasible routes with negative reduced costs to be added in the master problem. As the capacity of the vehicles, $q_k=q$ $\forall k\in K$ , is be the same for all vehicles, we solve the problem for $K=\{1\}$ . The explicit formulation of the subproblem is given as follows.

Column Generation Algorithm Implementation

We run the column generation in Python as follows.

import pandas as pd
from cvrptw_optimization import single_depot_column_generation_pulp as cg

# read input data
customers = pd.read_pickle(r'data/customers.pkl')
depots = pd.read_pickle(r'data/depots.pkl')
transportation_matrix= pd.read_pickle(r'data/transportation_matrix.pkl')
vehicles = pd.read_pickle(r'data/vehicles.pkl')
vehicle_capacity = 60

# run column generation
solution, iteration_statistics = cg.run_single_depot_column_generation(depots,
                                                                       customers,
                                                                       transportation_matrix,
                                                                       vehicles,
                                                                       vehicle_capacity,
                                                                       mip_gap=0.001,
                                                                       solver_time_limit_minutes=10,
                                                                       enable_solution_messaging=0,
                                                                       solver_type='PULP_CBC_CMD',
                                                                       max_iteration=150)

In the solution, we deliver with $12$ trucks driving total $16.4$ hours (Figure 5). Algorithm run time is less than 2 minutes.


Figure 5: Column generation solution

Figure 6 shows algorithm convergence. Note that subproblem objective reached $> -1$ in $75$ iterations.


Figure 6: Column generation algorithm convergence

PPizza Solution

As a result, column generation uses less number of trucks than the general mixed integer formulation.

Figure 7 illustrates solution for PPizza with 12 trucks. Each truck leaves depot at 6am and returns by 5pm.


Figure 7: PPizza solution

References

Desrochers, M., Lenstra, J.K., Savelsbergh, M.W.P., Soumis, F. (1988). Vehicle routing with time windows: Optimization and approximation. In: Golden, B.L., Assad, A.A. (Eds.), Vehicle Routing: Methods and Studies. North-Holland, Amsterdam, pp. 65–84.

2019 INFORMS Annual Conference

2019-10-20T00:00:00+00:00

The 2019 INFORMS Annual Meeting was held at Seattle from October 20 to October 23. There were over 7,000 attendees which was record-breaking.

I organized OMS/Practice Curated: Contemporary Scheduling session at the conference. Session was about the Operations Research applications and attracted about forty people.

I presented my work, Network Design with Routing Consideration, where we developed a network design algorithm to identify location of distribution facilities by considering store deliveries.

The following is the slide deck.

Predicting Short Term Trucking Rates with Random Forests

2019-08-30T00:00:00+00:00

In this post, we present a random forest model to predict short term trucking rates using Python.

Transportation rates are driven by different modes of transportation (air, road, rail, and ocean). In this work, we focus on trucking related transportation modes, Full Truckload (FTL) and Less than Truckload (LTL). We describe FTL and LTL modes, trailer types, and transportation services as follows.

FTL: An entire truck is used for transportation. In the FTL market, truck delivers to destination directly from shipper’s location using a dedicated truck. The rate is the same for the use of the full truck whether the truck is 100% full or 25% full and may be different depending on where the shipment starts and ends. Capacity of a truck can be measured in total weight, total cube, or total number of pallets. Various truck types are used such as regular dry van, refrigerated, flatbed, tanker, and 48- and 53-foot trailers. An FTL carrier may hold 45,000 pounds of product.

LTL: Companies use LTL when they have a small load to ship to a destination. In this case, hiring an entire truck to make the delivery is not economical. Trucking company picks up the load, combines it with other companies’ pickup or deliveries, and makes the trip to complete a route of deliveries to customer locations. An LTL carrier may hold up to 15,000 pounds of product.

Trailer Types: Different trailer types carry different products. The main types include dry van, flatbed, refrigerated/temperature-controlled trailer, and tank. Dry van is the most popular one. The flatbed trailer does not have a side wall or ceiling and used to carry construction materials or large machinery. Food and medicines are normally hauled by temperature-controlled. Tanks are used to haul refined oil products or chemicals in the liquid form.

Transportation Services: Two types of services exists to cover freight transportation requirements, through long-term contracts or on the spot market. Contract carriers is used most frequently. The contract is typically a one-year commitment, which consists of origin/destination, service requirement, volume and any other factors that affect the price. Spot market is used to obtain a rate when there is no availability in the contract market (lane does not exists or rate is not accepted).

Problem

Consider a network of customers and distribution centers (see Figure 1). Products are delivered from distribution centers to customers using trucks.


Figure 1: Distribution centers and customers

Our objective is to determine FTL and LTL rates for each distribution center to each customer. We develop a model to predict transportation rates.

Analysis

The following are the steps in the analysis, summarized by Figure 2.

Clean data, remove outliers
Create features (feature engineering)
Create train and test data
Develop model baseline
Fit model and measure performance
Interpret model and report results
Persist model


Figure 2: Analysis steps

We now explain each analysis step as follows.

1. Clean Data

The dataset consists of the European long-haul Truckload data, including mode, distance covered per shipment, average shipment weight per truck, average shipment cost per truck, and trailer type (see Figure 3).


Figure 3: Transportation data set

Figure 4 shows the transportation data profiling by distance miles, average shipment per truck, transportation cost per truck, mode, and trailer type. We provide statistics by number of data points, minimum value, 25% quantile, mean, median, 75% quantile, and maximum value. The following Python code can be used to generate those statistics.


Figure 4: Transportation data set profile

def quantile_25pct(x):
    '''
    Get 25% quantile
    '''
    return x.quantile(0.25)

def quantile_75pct(x):
    '''
    Get 75% quantile
    '''
    return x.quantile(0.75)

trans_costs_melted.groupby(['VARIABLE', 'MODE', 'TRAILER_TYPE'], as_index=False).agg({'VALUE': ['count', 'min', quantile_25pct, 'mean', 'median', quantile_75pct, 'max']})

We fist analyze relationships between distance miles, shipment weight per truck, transportation cost per truck, and transportation cost per truck per mile for each mode and trailer. Then, we identify and remove outliers from the data set. The interquartile range (IQR) rule is applied to be able to detect outliers.

The following python functions are used to generate plots and to detect outliers.

import seaborn as sns

def plot_scatter(figure_data, x_axis_column, y_axis_column, legend_column, fig_width, fig_height, font_scale, grid_column, grid_row=None):
    '''
    Create a scatter plot
    '''
    sns.set(font_scale=font_scale)
    sns.set_style("white")
    if grid_row is None:
        g = sns.FacetGrid(figure_data, col=grid_column, hue=legend_column)
    else:
        g = sns.FacetGrid(figure_data, col=grid_column, row=grid_row, hue=legend_column)
    g = (g.map(sns.scatterplot, x_axis_column, y_axis_column, edgecolor="w").add_legend())
    g.fig.set_size_inches(fig_width, fig_height)

import numpy as np

def detect_outlier(data):
    '''
    Detect outliers using the IQR rule
    '''
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    outliers = np.full(len(data), False)
    outliers[data < (Q1 - 1.5 * IQR)] = True
    outliers[data > (Q3 + 1.5 * IQR)] = True
    
    return outliers

Figure 5 shows relationships for FTL. We observe that there is a linear relationship between distance traveled and transportation cost per truck which means that he FTL rate is the same whether the truck is 100% full or 25% full and changes depending on where the shipment starts and ends. We also see that FTL shipments in temperature controlled trailers are $0.7 more expensive per mile than the shipments in dry van (Figure 4).





Figure 5: FTL profiles

There are FTL shipments where shipment weight per truck is less than 10,000 LBS. We also identified points with a large transportation cost per truck per mile using the IQR rule. We consider those points as outliers and remove from the data set (see Figure 6).



Figure 6: FTL outliers

Figure 7 shows relationships between distance miles, shipment weight per truck, transportation cost per truck, and transportation cost per truck per mile for LTL. There is no clear distinct relationship between any of those variables.





Figure 7: LTL profiles

We treat shipments more than 5,000 mile distance as outliers since firms prefer FTL shipments for long distance since it is more economical comparing to LTL. After removing outlier points, we see a clear relationship between distance miles and transportation cost per truck for LTL (Figure 8).


Figure 8: LTL outliers

Similar in FTL, there exists LTL shipments with a large transportation cost per mile. We use the IQR rule to detect those outliers (see Figure 9).



Figure 9: LTL outliers

2. Create Features (Feature Engineering)

Feature engineering is the process of using domain knowledge of the data to create features for machine learning algorithms.

We use one-hot encoding to convert categorical data into a numerical format without losing any information. Figure 10 shows how FTL data is transformed as an example. We also provide Python code below for one-hot encoding.



Figure 10: FTL one-hot encoding

import pandas as pd

trans_cost_ftl_with_one_hot = pd.get_dummies(trans_cost_ftl.drop(['MODE', 'OUTLIER', 'TRANS_COST_PER_TRUCK_USD_PER_MILE'], 1))

3. Create Train and Test Data

At this step, we split data into training and testing sets to evaluate performance of the model. We randomly select 75% of the data for training and 25% of data for testing using the following Python function.

from sklearn.model_selection import train_test_split

def create_train_test_splits(labels_data, features_data, test_size):
    '''
    Create train and test data
    '''
    labels = np.array(labels_data)
    features_list = list(features_data.columns)
    features = np.array(features_data)

    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = test_size, random_state = 42)

    return train_features, test_features, train_labels, test_labels

4. Develop Model Baseline

Before making predictions, we need to develop a model baseline to level set performance of the model. If the model can not improve the baseline, we need to try a new model.

In our problem, we set the baseline prediction as the average transportation cost per mile by mode and trailer type. We calculate Mean Absolute Error (MAE), Mean Percentage Error (MAPE), and accuracy as performance metrics.

We now define MAE, MAPE, and accuracy. Let $y_i$ be the prediction and $x_i$ be the actual value for $i=1, \dots, n$ .

$MAE = \frac{1}{n}\sum_{i=1}^n|y_i-x_i|$ $MAPE = 100\frac{1}{n}\sum_{i=1}^n\frac{|y_i-x_i|}{x_i}$ $Accuracy = 100 - MAPE$

The following Python code is used to calculate baseline and accuracy metrics.

 
def apply_baseline_and_calculate_performance(trans_cost_with_one_hot, test_features, features_list, test_labels):
    '''
    Calculate baseline
    '''
    baseline = trans_cost_with_one_hot.groupby(['MODE', 'TRAILER_TYPE_DRY VAN', 'TRAILER_TYPE_TEMPERATURE CONTROLLED'], as_index=False).agg({'TRANS_COST_PER_TRUCK_USD': 'sum', 'DISTANCE_MILES': 'sum'})
    baseline['TRANS_COST_PER_TRUCK_USD_PER_MILE'] = baseline['TRANS_COST_PER_TRUCK_USD'] / baseline['DISTANCE_MILES']
    baseline.drop(['DISTANCE_MILES'], 1, inplace=True)
    
    baseline_costs = pd.DataFrame(test_features)
    baseline_costs.columns = features_list
    
    baseline_costs = baseline_costs.merge(baseline, how='left', on=['TRAILER_TYPE_DRY VAN', 'TRAILER_TYPE_TEMPERATURE CONTROLLED'])
    baseline_costs['BASELINE_TRANS_COST_PER_TRUCK_USD'] = baseline_costs['DISTANCE_MILES'] * baseline_costs['TRANS_COST_PER_TRUCK_USD_PER_MILE']
    baseline_costs['ACTUAL_TRANS_COST_PER_TRUCK_USD'] = test_labels
    baseline_costs['ABSOLUTE_ERROR'] = abs(baseline_costs['BASELINE_TRANS_COST_PER_TRUCK_USD'] - baseline_costs['ACTUAL_TRANS_COST_PER_TRUCK_USD'])
    baseline_costs['MAPE_PCT'] = 100 * (baseline_costs['ABSOLUTE_ERROR'] / baseline_costs['ACTUAL_TRANS_COST_PER_TRUCK_USD'])
    
    mean_absolute_error = round(np.mean(baseline_costs['ABSOLUTE_ERROR']), 2)
    accuracy = round(100 - np.mean(baseline_costs['MAPE_PCT']), 2)
    
    return baseline_costs, mean_absolute_error, accuracy

We calculate baseline performance metrics for FTL and LTL is as follows. Those provide us goals which is model performance should be better than baseline performance.

	FTL Baseline	LTL Baseline
MAE	434.46	352.96
MAPE	12.38%	66.9%
Accuracy	87.62%	33.1%

5. Fit Model and Measure Performance

We use the train data to fit the random forest model and the test data to measure model performance. We calculate MAE, MAPE, and Accuracy similar to the baseline analysis.

The following functions are used to fit the random forest model and measure the model performance.

 
from sklearn.ensemble import RandomForestRegressor

def fit_random_forest_model(train_features, train_labels):
    '''
    Fit a random forest model
    '''
    random_forest = RandomForestRegressor(n_estimators = 1000, random_state = 42)
    random_forest.fit(train_features, train_labels)
    return random_forest

def prediction_and_metrics(random_forest, test_features, test_labels):
    '''
    Predict and measure the model performantce
    '''
    predictions = random_forest.predict(test_features)
    
    mae = round(np.mean(abs(predictions - test_labels)))
    mape = np.mean(100 * (mae / test_labels))
    accuracy = 100 - np.mean(mape)
    
    return predictions, mae, mape, accuracy

The following table provides baseline and random forest model performance metrics for FTL and LTL.

	FTL Baseline	LTL Baseline	FTL Random Forest Model	LTL Random Forest Model
MAE	434.46	352.96	188.0	123.0
MAPE	12.38%	66.9%	7.12%	33.10%
Accuracy	87.62%	33.1%	92.88%	66.90%

Both FLT and LTL models beat the baseline prediction. FTL model prediction has 94% accuracy. However,
LTL model’s accuracy is 67% which is lower than FTL model.

One way to improve LTL model’s performance is hyperparameter tuning where the model settings are adjusted to improve performance. Another way is to add more features to the data set to capture behavior better.

6. Interpret Model and Report Results

We can use two methods to be able to understand how model calculates the values,

Visualizing a random forest tree
Understanding feature importance of variables

Visualizing A Random Forest Tree

The following Python code is used to visualize one of the random forest trees in the model.

 
from IPython.display import SVG
from graphviz import Source
from sklearn.tree import export_graphviz 

def create_random_forest_tree_image(random_forest_model, tree_number, features_list, tree_file_name):
   '''
   Create and save random forest tree image
   '''
    tree_in_forest = random_forest_model.estimators_[tree_number]
    graph = Source(export_graphviz(tree_in_forest, out_file=None
       , feature_names=features_list
       , filled = True))

    png_bytes = graph.pipe(format='png')
    with open('{}.png'.format(tree_file_name),'wb') as f:
        f.write(png_bytes)
        
    return 'tree is created'

Figure 11 shows one of the random forest trees in the model. Since the tree is very large, it is hard to visualize the relationships between model variables.


Figure 11: Random forest tree

We limit maximum depth to be 3 to be able to visualize a tree (see Figure 12).


Figure 12: Random forest tree where maximum depth = 3

Let’s assume that we want to predict transportation cost per shipment with the following shipment features.

Mode	Trailer Type	Weight Load (LBS)	Shipment Distance (Miles)
FTL	Dry Van	17,404	341

We follow the path in Figure 13 and predict FTL transportation cost per shipment as $1,277.


Figure 13: Predicting for FTL rate for a Dry Van, 341 miles distance, and 17,404 LBS weight with the model where maximum depth = 3

Understanding Feature Importance of Variables

The relative importance is defined as how much including a particular variable improves the prediction. We use the following code to calculate importance of model variables.

 
def calculate_feature_importance(random_forest_model, feature_list):
    '''
    Calculate feature importance
    '''
    importances = list(random_forest_model.feature_importances_)
    feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
    
    feature_importances = pd.DataFrame(feature_importances)
    feature_importances.columns = ['FEATURE', 'IMPORTANCE']
    
    return feature_importances

Figure 14 shows feature importance for FTL and LTL. For FTL, miles travelled is the most important factor for predicting transportation cost per shipment. This is aligned with the observation we have before (see Figure 5). Miles travelled and weight per shipment are the two most important variables affecting transportation cost per shipment for LTL.



Figure 14: Feature importance for FLT and LTL

7. Persist Model

Model persistence is a technique where trained model is written or persisted to the disk. And once you have your model saved on the disk, you can use it whenever you want. After you read and load the file and get the trained model back that you can use for making predictions. This is a very powerful technique because now you don’t have to train the model every time in order to use the trained model. You can persist your trained model once, and then you can use it later. You can also share your training model with others without sharing the training data and all of the steps to train the model (see Figure 15).


Figure 15: Model persistence

We use the following Python code to persist the random forest model.

 
import pickle

def persist_model(random_forest_model, feature_list, train_features, train_labels, test_features, test_labels, filename):
    '''
    Persist model
    '''
    tuple_objects = (random_forest_model, feature_list, train_features, train_labels, test_features, test_labels)
    pickle.dump(tuple_objects, open(filename, 'wb'))
    return 'model saved to '.format(filename)

We load the saved model using the following Python function.

 
def load_model(filename):
    '''
    Load persisted model
    '''
    random_forest_model, feature_list, train_features, train_labels, test_features, test_labels = pickle.load(open(filename, 'rb'))
    return random_forest_model, feature_list, train_features, train_labels, test_features, test_labels

Visualizing Network Optimization Model Results Using Python

2019-08-25T00:00:00+00:00

A quick way to validate network optimization model results is visually creating optimal flows map which shows flows between source and destination. This post explains how to create such visualizations using Python.

The Greenfield Algorithm uses customer locations with annual demand as an input and calculates allocation of distribution centers to customers. Distribution center and customer locations, and optimal flows maps can be used to visualize model inputs and outputs. Figures 1 and 2 show these maps.


Figure 1: Locations map


Figure 2: Optimal flows map

Python can be used to create location and optimal flows maps quickly. This helps modelers to validate model inputs and results.

Application

We use results from the Greenfield analysis to build

Locations map: Consists of distribution center and customer locations.
Optimal flows map: Consists of distribution center, customer locations, and flows between those points.

In the Python code, we first initiate libraries and define colors and shapes lists. Then, we read the results data.

import pandas as pd
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()

colors = ['rgb(0, 128, 155)', 'rgb(255, 128, 0)', 'rgb(191, 2, 2)', 'rgb(0, 175, 181)', 'rgb(0, 181, 78)', 'rgb(181, 175, 0)', 'rgb(130, 0, 181)', 'rgb(230, 0, 195)', 'rgb(201, 67, 0)']
shapes = ['circle', 'triangle-down', 'square', 'diamond', 'square', 'cross']

algorithm_results = pd.read_csv(r'https://raw.githubusercontent.com/emrahcimren/Greenfield_Bluefield_With_Weighted_Kmeans/v1.1/data/results/customers_with_clusters.csv')
algorithm_results = algorithm_results[(algorithm_results['NUMBER_OF_CLUSTERS']==9) & (algorithm_results['ITERATION']==10)]
filter_paths = (algorithm_results['CLUSTER'] == 3) & (algorithm_results['CUSTOMER_NAME'] == 'Customer 87')
algorithm_results = algorithm_results[~filter_paths]

Locations Map

Locations map consists of base map and distribution center and customer locations. We create location points as follows in Python.

point_locations_customers = algorithm_results[['CUSTOMER_NAME', 'LATITUDE', 'LONGITUDE', 'DEMAND']].drop_duplicates().rename(columns={'CUSTOMER_NAME': 'LOCATION_NAME', 'DEMAND': 'LOCATION_WEIGHT'})
point_locations_customers['LOCATION_TYPE'] = 'CUSTOMER'
point_locations_customers['LOCATION_WEIGHT_FACTOR'] = 30
point_locations_customers['ADJUST_MARKER_SIZE'] = True

point_locations_warehouses = algorithm_results.groupby(['CLUSTER', 'CLUSTER_LATITUDE', 'CLUSTER_LONGITUDE'], as_index=False).agg({'DEMAND': sum}).rename(columns={'CLUSTER': 'LOCATION_NAME', 'CLUSTER_LATITUDE': 'LATITUDE', 'CLUSTER_LONGITUDE': 'LONGITUDE', 'DEMAND': 'LOCATION_WEIGHT'})
point_locations_warehouses['LOCATION_TYPE'] = 'DISTRIBUTION CENTER'
point_locations_warehouses['LOCATION_WEIGHT_FACTOR'] = 50
point_locations_warehouses['ADJUST_MARKER_SIZE'] = False

point_locations = point_locations_customers.append(point_locations_warehouses)

Figure 3 shows point locations data.


Figure 3: Point locations data from Python

The following function adds marker size, color, and shape to each location point.

def add_shapes_and_colors_to_locations_for_visualization(locations, colors, shapes):
    '''
    Function to add marker sizes, colors, and shapes to locations
    :param locations:
    :param colors: List of colors
    :param shapes: List of shapes
    :return: Updated locations
    '''

    location_types = locations['LOCATION_TYPE'].unique()

    locations_list = []
    for idx_loc, location_type in enumerate(location_types):

        by_location_type = locations[locations['LOCATION_TYPE'] == location_type]
        maximum_weight_factor = by_location_type['LOCATION_WEIGHT_FACTOR'].mean() / by_location_type[
            'LOCATION_WEIGHT'].max()

        for _, location in by_location_type.iterrows():

            if location['ADJUST_MARKER_SIZE']:
                marker_size = location['LOCATION_WEIGHT'] * maximum_weight_factor
            else:
                marker_size = location['LOCATION_WEIGHT_FACTOR']

            locations_list.append({
                'LOCATION_NAME': location['LOCATION_NAME'],
                'LOCATION_TYPE': location['LOCATION_TYPE'],
                'HOVER_TEXT': location['LOCATION_NAME'],
                'LATITUDE': location['LATITUDE'],
                'LONGITUDE': location['LONGITUDE'],
                'LOCATION_WEIGHT': location['LOCATION_WEIGHT'],
                'MARKER_SIZE': marker_size,
                'MARKER_COLOR': colors[idx_loc],
                'MARKER_SHAPE': shapes[idx_loc]
            })

    return pd.DataFrame.from_records(locations_list)

point_locations = add_shapes_and_colors_to_locations_for_visualization(point_locations, colors, shapes)

We use the following visualization function to create maps from location points. In the function, we define point locations using latitude and longitudes and map layout. Resulting location map is saved to an .hmtl file. Figure 4 shows the location map output.

def visualize_points_and_flows(point_locations, paths, map_title, scope, output_html_file):
    '''
    Function to visualize points and flows
    :param point_locations: Point to be visualized with latitude and longitude
    :param paths: From to flows
    :param map_title: Title
    :param scope: Region name; europe, north america
    :param output_html_file: name of the output file
    :return:
    '''

    locations = [dict(
        type='scattergeo',
        locationmode='country names',
        lon=point_locations['LONGITUDE'],
        lat=point_locations['LATITUDE'],
        hoverinfo='text',
        text=point_locations['HOVER_TEXT'],
        mode='markers',
        marker=dict(
            size=point_locations['MARKER_SIZE'],
            color=point_locations['MARKER_COLOR'],
            symbol=point_locations['MARKER_SHAPE'],
            line=dict(
                width=5,
                color='rgba(68, 68, 68, 0)'
            ),
        ))]

    layout = dict(
        title=map_title,
        titlefont=dict(size=30),
        showlegend=False,
        autosize=True,
        hovermode='closest',
        geo=dict(
            scope=scope,
            showframe=False,
            projection=go.layout.geo.Projection(type='azimuthal equal area', scale=15),
            center={'lat': point_locations['LATITUDE'].mean(), 'lon': point_locations['LONGITUDE'].mean()},
            showland=True,
            landcolor='rgb(243, 243, 243)',
            countrycolor='rgb(204, 204, 204)',
            showcountries=True
        ),

    )

    if paths is None:
        return plotly.offline.plot({"data": locations, "layout": layout}, filename='{}.html'.format(output_html_file))
    else:
        return plotly.offline.plot({"data": locations + paths, "layout": layout},
                                   filename='{}.html'.format(output_html_file))

visualize_points_and_flows(point_locations, None, 'Distribution Center and Customer Locations with Demand', 'europe', 'point_visualization')


Figure 4: Location map

Optimal Flows Map

We visualize source-destination flows using the optimal flows map. Source-destination flows is created from the Greenfield analysis results as in Figure 5.

flows = algorithm_results[['CLUSTER', 'CLUSTER_LATITUDE', 'CLUSTER_LONGITUDE', 'CUSTOMER_NAME', 'LATITUDE', 'LONGITUDE', 'WEIGHTED_DISTANCE']].rename(columns={'CUSTOMER_NAME': 'DESTINATION_NAME', 'LATITUDE': 'DESTINATION_LATITUDE', 'LONGITUDE': 'DESTINATION_LONGITUDE', 'CLUSTER': 'SOURCE_NAME', 'CLUSTER_LATITUDE': 'SOURCE_LATITUDE', 'CLUSTER_LONGITUDE': 'SOURCE_LONGITUDE', 'WEIGHTED_DISTANCE': 'PATH_WEIGHT'})


Figure 5: Flows data

Marker colors in the location data is updated using the following function.

def update_locations_colors_for_flow_visualization(flows, locations, colors):
    '''
    Function to update colors for the flow map
    :param flows: From to locations
    :param locations: Point locations
    :param colors: Plot colors
    :return: Update locations and mapped colors to sources
    '''

    color_base_column = 'LOCATION_TYPE'
    color_base_value = 'DISTRIBUTION CENTER'

    color_bases = locations[locations[color_base_column] == color_base_value]
    color_bases = color_bases.sort_values(['LOCATION_NAME'])
    color_bases = pd.DataFrame(
        {'SOURCE_NAME': color_bases['LOCATION_NAME'], 'MARKER_COLOR': colors[:len(color_bases['LOCATION_NAME'])]})

    for _, color_base in color_bases.iterrows():
        by_flow = flows[flows['SOURCE_NAME'] == color_base['SOURCE_NAME']]
        by_flow_list = by_flow['DESTINATION_NAME'].tolist() + [color_base['SOURCE_NAME']]
        filter_paths = locations['LOCATION_NAME'].isin(by_flow_list)
        locations.loc[filter_paths, 'MARKER_COLOR'] = color_base['MARKER_COLOR']

    return pd.DataFrame.from_records(locations), color_bases

point_locations, color_bases = update_locations_colors_for_flow_visualization(flows, point_locations, colors)

After updating marker colors in the location data, we also add marker colors to the flows data. Flows data is used generate path layer to the optimal flows map.

flows['SOURCE_NAME'] = flows['SOURCE_NAME'].astype(str)
color_bases['SOURCE_NAME'] = color_bases['SOURCE_NAME'].astype(str)
flows = flows.merge(color_bases, how='left', on=['SOURCE_NAME'])
flows['PATH_WEIGHT_FACTOR'] = 5

def create_paths(flows):
    '''
    Path layer to visualize source-destination flows on the map
    :param flows:
    :return: paths
    '''

    maximum_weight_factor = flows['PATH_WEIGHT_FACTOR'].mean() / flows['PATH_WEIGHT'].max()

    paths = []
    for _, from_to_flow in flows.iterrows():
        paths.append(
            dict(
                type='scattergeo',
                locationmode='country names',
                text='from {} to {}'.format(from_to_flow['SOURCE_NAME'], from_to_flow['DESTINATION_NAME']),
                lon=[from_to_flow['SOURCE_LONGITUDE'], from_to_flow['DESTINATION_LONGITUDE']],
                lat=[from_to_flow['SOURCE_LATITUDE'], from_to_flow['DESTINATION_LATITUDE']],
                mode='lines',
                line=dict(
                    width=from_to_flow['PATH_WEIGHT'] * maximum_weight_factor,
                    color=from_to_flow['MARKER_COLOR'],
                ),
                opacity=0.5,
            )
        )

    return paths

paths = create_paths(flows)

Finally, the following function creates the optimal flows map as in Figure 6.

visualize_points_and_flows(point_locations, paths, 'Optimal Flows', 'europe', 'flow_visualization')


Figure 6: Optimal flows map

Weighted Clustering with Minimum-Maximum Cluster Sizes, Greenfield Analysis

2019-08-23T00:00:00+00:00

This post provides a center of gravity based algorithm for a greenfield analysis. Algorithm is based on k-means clustering enhanced with optimization.

Greenfield Analysis

Greenfield analysis is a quick way to identify optimal distribution center locations for a given demand network. The analysis answers the following questions:

Where should distribution centers be geographically located to minimize cost?
Which customers will be supplied from each distribution center?

Problem

We consider a network of customers where demand for each customer is satisfied by a distribution center. Figure 1 shows the customer locations and corresponding annual demand.


Figure 1: Customer demand map

It is assumed that any location can be selected for a distribution center location.

We answer the following questions as a result of the analysis:

How many distribution centers are required?
Where should each distribution center be located?
How customers should be allocated to the distribution centers?

Algorithm

Algorithm is based on the center of gravity approach and selects the each distribution center locations such that total weighted distance to customers is minimized.

We first provide the following definitions related to the distribution network.

Let $n$ be the total number of distribution centers. Let $D=\{1,\dots,n\}$ be the set of distribution centers and $\hat{D}_i$ be the set of customers allocated to the distribution center $i\in D$ . Each distribution center $i\in D$ has latitude and longitude, $\phi^{d}_i$ and $\lambda^{d}_i$ , respectively. Let $\alpha_i$ be the maximum distance covered by the center $i \in D$ . Let $u^{-}_i$ and $u^{+}_i$ be the minimum and maximum number of customers can be allocated to center $i \in D$ , respectively.

Let $C$ be the set of customers. Similar to distribution centers, each customer $j\in C$ has latitude and longitude, $\phi^{c}_j$ and $\lambda^{c}_j$ , respectively. Let $d_j$ be the total demand for the customer $j\in C$ .

Let $\Delta_k$ be the total weighted distance at iteration $k$ .

Let $a_{ij}$ be the distance from the center $i \in D$ to customer $j \in C$ . Each $a_{ij}$ is defined using the Haversine distance formula which calculates the shortest distance between two points on a sphere using their latitudes and longitudes measured along the surface. You can find more detailed explanation at the Wikipedia page.

The algorithm consists of the following steps:

Step 0: Set $k=0$ . For given $D=\{1, \dots, n\}$ , create $\hat{D}_i$ randomly. Let $\Delta_0 = \sum_{i\in D}\sum_{j\in \hat{D}_i}d_ja_{ij}$ .

Step 1: Set $k=k+1$ . For each $i \in D$ , $j \in C$ , $u^{-}_i$ , $u^{+}_i$ , $\alpha_i$ , and $a_{ij}$ , run a binary allocation model to determine $\hat{D}_i$ . Let $\Delta_k = \sum_{i\in D}\sum_{j\in \hat{D}_i}d_ja_{ij}$ .

Step 2: If $\Delta_k \ge \Delta_{k-1}$ , then stop. Otherwise, set $\phi^{d}_i = \frac{\sum_{j \in \hat{D}_i} \phi^{c}_j}{|D_i|}$ and $\lambda^{d}_i = \frac{\sum_{j \in \hat{D}_i} \lambda^{c}_j}{|D_i|}$ $\forall i\in D$ . Repeat Steps 1-2 until $\Delta_k \ge \Delta_{k-1}$ .

Allocation of Customers to Distribution Centers

Once cluster centers are determined, we apply a binary model to allocate customers to distribution centers.

In addition to parameters defined in the Algorithm section, let $y_{ij}$ be the binary variable for assigning center $i \in D$ to customer $j \in C$ .

The following is the binary program for allocating customers to distribution centers.

The objective of the model $(1)$ minimizes the total weighted distance. Constraint $(2)$ ensures that each customer is allocated to a center. There exists a maximum distance can be covered by each center $(3)$ . Each center has minimum and maximum number of allocated customers as in $(4)$ and $(5)$ , respectively. Center to customer allocation is binaries as in $(6)$ .

Implementation

The algorithm is implemented in Python. Google OR Tools is used to solve the allocation problem.

You can find the source code at the Greenfield_With_Weighted_Kmeans repository on GitHub.

Application

The algorithm is applied on the given problem. We iterate the algorithm for $n=3,\dots,19$ . Figure 2 shows run results.


Figure 2: Results for $n=3,\dots,19$

Total weighted distance decreases as the number of clusters increases. More specifically, From $n=3$ to $n=7$ , the total weighted distance is reduced by 39%. We find that $n=9$ is the best configuration since objective function improves slightly for $n>9$ and opening a new distribution center is costly.

Figure 3 shows iteration steps for $n=9$ . Algorithm converges quickly in the first three iterations and the total weighted distance reduces by $62\%$ .


Figure 3: Algorithm iterations for $n=9$

Allocation of customers to distribution centers is shown in Figure 4 for $n=9$ and $k=10$ . Each distribution center covers customers in average $300$ miles radius.


Figure 4: Optimal flow map for $n=9$ and $k=10$

Figure 5 shows cluster statistics for $n=9$ . We bound cluster size as $\mp$ 20% of average cluster size which is minimum $16$ and maximum $25$ for $n=9$ . Cluster 8 average weighted distance is the lowest among all other clusters as well as the average distance.


Figure 5: Cluster statistics for $n=9$

Future Work

The Algorithm can be generalized to start with given $\hat{D}_i$ $\forall i\in D$ instead of creating $\hat{D}_i$ randomly. This will help to improve solution quality as well as to cover broader network design problems than the greenfield analysis.

Hello

2019-08-22T00:00:00+00:00

First post…