Transforming Geospatial Data for Better Model Performance

An experimental approach in feature-engineering latitude and longitude

keanshern kok
Geek Culture

--

Photo by Denys Nevozhai on Unsplash

Introduction

I recall being completely stumped during my undergraduate years when given a dataset that contained longitude and latitude. My youthful brain simply could not fathom the utility of coordinates. Perhaps it was the insurmountable backlog of assignments and assessments that was clouding my mind. Fast forward a couple of years and I’m now ready to wring out every last drop of value these variables have to offer.

Experimental Approach

Eliminating confounding factors to protect experimental accuracy.

The Data

I downloaded the Airbnb listings dataset from “Inside Airbnb”. The first thing I did was to import the data into a Pandas dataframe and drop all columns except for price (target variable), latitude/longitude (predictor variables under scrutiny) and neighbourhood_group (predictor variable for comparison).

Coordinates are inherently domain-dependent attributes. What I mean is that the extent to which I can conduct feature engineering is bound by the limits of how much I know about the location it refers to. As such, I opted for the Singapore Airbnb listings dataset as it was the country I had most familiarity with, compared to the rest of the available countries.

Exploratory Data Analysis (EDA) & Data Cleaning

The next thing I did was to examine the distribution for the variable price. Using the code below gave me the report as shown in the image.

# Define the quantiles for analysis.
quantile_range = [0.8,0.85,0.9,0.91,0.92,0.93,0.94, 0.95, 0.96,0.97,0.98,0.99]
print("Price (SGD) and skewness from 80th percentile onwards:")# Each iteration prints the price (rounded) and skewness for the
# given quantile, q.
for q in quantile_range:
listings_q = listings.loc[(listings['price'] < listings['price'].quantile(q)) & (listings['price'] != 0)]
quantile = str(q)
skewness = str(listings_q['price'].skew())
price_quantile = str(round(listings['price'].quantile(q)))
print("\t".join([quantile,price_quantile,skewness]))
Distribution report for price (made by the author)

I wanted to eliminate the listings that were significantly more expensive than the rest, yet preserve as much data as possible. I decided to keep only the listings up to the 98th percentile for price as a good trade-off between data quantity and quality. No missing values were present for all three columns.

Lastly, I wanted to use neighbourhood_group as a point of comparison. As shown in the bar chart, a large majority of listings were located in Singapore’s central region. This isn’t surprising given the fact that it houses the Central Business District. As part of the test, I created a derived variable which is neighbourhood_group grouped into two classes: Central Region and Other. These nominal categorical variables were one-hot-encoded as preparation for modeling.

Bar Chart depicting distribution of listings by region (made by the author)

Identifying Point of Interest (POI)

Instead of using latitude and longitude as independent attributes, I used them to measure the distance between the listing and some point of interest (POI). Selecting the right POIs was a critical factor in how well the distance metrics could help in predicting accurate prices. One idea I had was that I could use a consumer’s reason for stay as a way of pinpointing POIs.

I did some reading and came up with two general reasons consumers use Airbnb:

  1. Holiday
  2. Business trip

Popular travel website Tripadvisor ranked Gardens by the Bay as the top tourist spot in Singapore (under “Things to do”) based on a number of metrics. This forms our first POI which has the following coordinates: (1.2815737, 103.8614245).

Things to do ranked using Tripadvisor data including reviews, ratings, photos, and popularity.

Shenton Way is a financial and business hub in Singapore. This location serves as our second POI with the coordinates (1.2760995,103.8460012). However, these two locations are only a 9-minute drive away from each other. It will be interesting to see if there are significant differences in model accuracy when using one over the other.

Algorithm and Evaluation Metrics

I used the Simple Linear Regression algorithm to model the relationship between distance/neighborhood group and price.

For model evaluation, I used Mean Absolute Error (MAE) and Coefficient of Determination (R-squared). Training data will be sampled using k-fold Cross Validation with 5 splits.

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
X = df[[<predictor>]]
y = df[[<target>]]
kf = KFold(n_splits = 5, random_state = None
model = LinearRegression()
scores = cross_validate(model, X, y, scoring=['neg_mean_absolute_error','r2'], cv=kf)print('Average MAE: ',sum(-1*scores['test_neg_mean_absolute_error'])/5)
print('Average R-squared',sum(scores['test_r2'])/5)

Feature Engineering the Predictor Variables

Code-based explanation for feature transformations.

One-hot-encoding and Grouping

I one-hot-encoded neighbourhood_group using Pandas’ get_dummies() method.

# One-hot-encoding neighbourhood_group then dropping the column.# use concat to join the new columns with your original dataframe.
listings = pd.concat([listings,pd.get_dummies(listings['neighbourhood_group'], prefix='ng_')],axis=1)
listings = listings.drop(columns=['neighbourhood_group'])

As shown previously, the class “Central Region” in neighbourhood_group clearly dominates the dataset. I created a derived variable “isCentral” which has the values 1 (listing is in Central Region) and 0 (listing is not in Central Region).

Deriving Distance Between Locations

I was interested in testing two types of distances. Mathematical distances are the simplest approach to measuring two geolocations as all you need is readily available statistical Python libraries. However, getting from point A to point B rarely takes a straight line route. Instead, the actual road distance between two locations could be significantly different from its mathematical distance. So why would one logically use mathematical distance instead of road distance? Well, road distance requires calling an API provided as a service (e.g. Google Maps Distance API or Distance Matrix API). This might incur costs depending on the number of requests you need to make.

The first distance metric that I derived from the coordinates was Euclidean Distance, which is simply the straight-line distance between two points in vector space. The implementation is shown below:

import numpy as np# Coordinate points for the listing, Gardens and Shenton.
point_listing = np.array((row['ls_lat'], row['ls_long']))
point_gardens = np.array((1.2815737, 103.8614245))
point_shenton = np.array((1.2760995,103.8460012))
# Calculate Euclidean Distance.
euc_dist_gardens = np.linalg.norm(point_listing - point_gardens)
euc_dist_shenton = np.linalg.norm(point_listing - point_shenton)

Haversine Distance is a measure of distance between two locations on the surface of a sphere. My code is adapted from this article.

# Calculating Haversine Distance.
def haversine_array(lat1, lng1, lat2, lng2):
# Convert coordinates to radians.
lat1,lng1,lat2,lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
avg_earth_radius = 6371 # in km
lat = lat2 - lat1
lng = lng2 - lng1
# Compute distance.
d = np.sin(lat*0.5)**2+np.cos(lat1)*np.cos(lat2)*np.sin(lng*0.5) **2
h = 2 * avg_earth_radius * np.arcsin(np.sqrt(d))
return h

Manhattan Distance is another popular distance metric in Machine Learning. In contrast to Euclidean Distance, it measures the distance between two points as the sum of the absolute differences of their Cartesian coordinates. This resource shows how to implement the formula using methods from the math library.

# Calculate Manhattan Distance.
def manhattan_distance(coord_ori,coord_dest):
return sum(abs(a-b) for a,b in zip(coord_ori,coord_dest))

Finally, I used the Distance Matrix API service to compute the road distance between locations. They have enterprise-level and personal-level subscription options. After subscribing, I got my token key which I saved in a text file. To send a GET request, you can populate the following code with your values.

import requests# Retrieve distance matrix token key.
api_file = open(<path to token key text file>,"r")
api_key = api_file.read()
# Define API endpoint.
url = 'https://api.distancematrix.ai/maps/api/distancematrix/json?'
param_1 = 'origins='
ori = ','.join([<latitude>,<longitude>])
param_2 = '&destinations='
param_3 = '&key='
dest = ','.join([<latitude>,<longitude>])
endpoint = ''.join([url,param_1,ori,param_2,dest,param_3,api_key])
# Send GET request and retrieve response.
response = requests.get(endpoint)
response= response.text

The JSON structure of a response is shown below. The API computed the road distance in meters and duration in seconds.

{
"destination_addresses":[
"Central Area, Singapore"
],
"origin_addresses":[
"Block 745, 745 Woodlands Cir, Singapore 730745"
],
"rows":[
{
"elements":[
{
"distance":{
"text":"30 km",
"value":30032
},
"duration":{
"text":"33 min",
"value":1985
},
"status":"OK"
}
]
}
],
"status":"OK"
}

Two routes of the same distance might not always have the same travel duration. The number of traffic lights, type of road and other physical factors might influence travel duration. As such, I created a derived variable distance_over_time which is simply the distance (in meters) between a listing and a POI divided by the duration of travel (in seconds).

Experimental Results

Experimental results (by the author)

Right off the bat, it was surprising to see that none of the geospatial predictor variables helped predict Airbnb prices, as shown by their negative R-squared values. R-squared is negative when a model does not follow the trend of the data (i.e. fits worse than a horizontal line).

Benchmarked against neighbourhood_group, none of the mathematical distances provided any significant improvements (or deterioration). Similarly, no improvement in model performance was observed for road distance and road distance over time. Road distance measures did not provide any significant advantage over mathematic distance measures. Finally, isCentral did not improve model performance.

Conclusion

One theory that could explain the results is that since Singapore is such a small country and most of the listings were located in the Central Area anyway (which hosts financial hubs and tourists attractions), distance was simply a negligible factor.

Another possibility is that I chose the wrong POIs. Distance might be useful if I were to measure it to other POIs such as train stations or bus stops. Singaporeans are heavily dependent on public transportation so this may be a worthwhile experiment to attempt.

All in all, this article provides some ideas on how to conduct feature engineering on coordinates. Negative results are still results so I hope you found it insightful in some way. For those well-versed in Machine Learning, leave some comments!

--

--