Assessment 2 - Weighted Redistribution

In which you apply the skills that you have learned to reproduce an algorithm from the literature

Introduction

Ever since the emergence of large scale web services for geocoding and plotting data, there has been a proliferation of ‘heatmaps’ seeking to reveal the spatial distribution of pretty much any pnenomena that mappers could get their hands on. You can find these maps everywhere!

You should always be wary of heatmaps for a number of reasons, chief amongst which is that changing the settings of the heatmap can quite dramatically change the results, and there is rarely any justification for the choice of those settings. However, the worst heatmaps occur when the underlying data are at multiple scales (i.e, multiple levels of generalisation), which often happens when they are passively geocoded (meaning that the locations were derived from place names in the dataset, which were not originally intended to be used to locate the data on a map).

The variation in scale in passively geocoded datasets means that they tend to suffer from an issue called false hotspots, in which the heatmaps unexpectedly give false patterns. You can see this effect in the below image, where a suffers from false hotspots; and b is the same dataset, restored to a pattern more reflective of the ‘true’ location of each data point.

Weighted Redistribution

This problem is discussed (in context of Twitter data relating to the Royal Wedding) of Prince William and Kate Middleton by a paper by Dr. Jonny Huck in the Journal of Spatial Information Science. The paper includes a simple algorithm, which allows false hotspots to be dissipated into realistic patterns based on a weighting surface (population density in this case). In general terms, Weighted Redistribution could be considered an example of a spatial disaggregation algorithm (of which there are many examples), but this one is quite unusual in that it disaggregates data into a surface, rather than smaller aerial units.

Download and read the article, making sure that you understand the problem of False Hotspots and the principles of the Weighted Redistribution algorithm that may be used to solve it.

The Question

Produce your own implementation of the weighted redistribution algorithm and apply it to the provided ‘tweets’ dataset in order to determine which parts of Greater Manchester were most interested in the Royal Wedding as part of a scheme to target advertising during such events.

You will have a random sample of 1,023 level 3 tweets relating to the Royal Wedding that have been geocoded to districts in Greater Manchester. Using these and the weighted redistribution algorithm, you should work out a likely spatial distirbution for Twitter activity across Manchester relating to the Royal Wedding.

Produce a 1,000 word report explaining and justifying your implementation of the weighted redistribution algorithm to the level 3 Twitter data in order to produce your map.

The analysis and outputs should be completed entirely in Python and the report should include at least one map. You must submit both your code and report.

Make sure that you read this whole document before you start!

The datasets for the Assessment are available here. This contains:

  • level3-tweets-subset the tweets themselves
  • 100m_pop_2019: a weighting surface (simple population density surface at 100m resolution)
  • gm-districts: polygons representing the districts (level 3) of Greater Manchester to which these tweets were geocoded.

Please note that you are not limited to the data provided, nor are you required to use the provided weighting surface - I would encourage you to include any data that you like to embellish the analytical or cartographic quality of your report.

Contents

Rules of Engagement

Submission

This work must be submitted by 14:00 on Thursday 19th December 2024 (Week 13).

You must submit your code and a report of up to up to 1000 words in length (±10%) describing the choices that you made in order to implement this algorithm in an elegant, efficient and robust manner.

Code snippets can be used in the report where appropriate (and are not included in the word count) - but they should be styled using an appropriate font (I recommend courier). Remember that I do not just want a line by line description of your code (that is what your comments are for)!

Note that you must submit your assessment as described here - not having read the below instructions will not be accepted as a reason for late or incorrect submission

  1. The Code part of your submission must be compressed into a .zip (not *.rar or any other format) file and submitted via a Dropbox File Request: - links provided below. The filename should be named using your student number in the format 123456789.zip.
    1. Undergraduates should submit here.
    2. Postgraduates should submit here.
  2. The Report part of your submission must be submitted as usual via the Turnitin on the Assessment page of the Blackboard site. The report should contain the map(s) that you have produced. Submitted files should be named in the format: 123456789.docx or 123456789.pdf etc.

Code Template

In order to assess the efficiency of your algorithm, I need to be able to time how long your code takes to execute the task. Accordingly, you must use the below template for your code, and ALL of your code must be between the # NO CODE ABOVE HERE and # NO CODE BELOW HERE comments.

"""
Understanding GIS: Assessment 2
@author [INSERT STUDENT NUMBER HERE]

An Implementation Weighted Redistribution Algorithm (Huck et al. 2015)
"""
# import the time library
from time import time

# set start time
start_time = time()	# NO CODE ABOVE HERE

'''ALL CODE MUST BE INSIDE HERE'''

# report runtime
print(f"completed in: {time() - start_time} seconds")	# NO CODE BELOW HERE

In addition to this, please do not use any libraries that are not already available within the understandinggis anaconda environment.

Zip File Structure

Remember that your submission must include all files that are required for your code to run successfully must be included (otherwise your code won’t work!).

I would recommend the following file structure within your main Assessment folder:

- 123456789/
	- 123456789.py
	- data/
		- 100m_pop_2019.*
		- gm-districts.*
		- level3-tweets-subset.*
		- [any other data files that you need]
	- out/
		- manchester_tweets.png

Please make sure that you name both your directory and python file using your student number!

It is also important that all of your file paths are relative file paths (e.g. ./data/my_file.shp). If you use absolute file paths (e.g. C:/jimbob/understandinggis/assessment1/data/my_file.shp) then your code will not work when I run it! Please also make sure that the paths link to the data files inside your file, and not to elsewhere on your computer!

A good way to ensure that this has worked is to either:

  • write your code in a new directory that is not inside your understandinggis directory - this will avoid accidentally pointing your to the wrong version of your datasets (i.e. the one that you use in the practicals)

or:

  • write your code inside your understandinggis directory as normal, then just before submission extract your zip file to a location elsewhere on your machine and test to see if it still works

If you don’t know how to zip a folder, you can see how to do so here:

Please make sure that you use .zip format (as per the instructions) - and not other forms of compression (e.g. .rar, .tar, .gz, .7z etc.).

Some Pointers

Remember the golden rule: Don’t Panic! You have done almost every part of this assessment in the course already, this is just a matter of finding the right bits and putting them together! The Pseudocode for this assessment is given in the paper, so all you have to do is implement it.

As with Assessment 1 - the key is to plan what you are going to do before you start coding - don’t just dive in and hope for the best! All you need to do is keep breaking down each stage into smaller and smaller jobs until you have a clear idea of how the program should look (think back to our session on computational thinking). As with Assessment 1, everything that you need to be able to do has been covered in the course before, so if you find yourself heading too far off the beaten track, then this is a good clue that you might be overcomplicating things!

As with before, you can reproduce this algorithm entirely using things that you have already done in this course, and additional help is available in the Hints and Tips section below. Remember - break each part of the algorithm down into small steps again and again until each step equates to approximately one line of code.

Note that you are only undertaking this analysis for a single level, so you can ignore the outer (first) loop described in the pseudocode in the paper (which loops through multiple levels).

The eagle-eyed of you will note from the paper that the original software is Open Source (written in Java) and published here. You are welcome to look at this to help you (this is the file with the algorithm in it if you want it) - but please don’t be too reliant on this code, or feel bad if you struggle to read it (much of it probably isn’t that useful to you, and it is in a completely different and much lower level language). Focus on the approaches that you have learned in this course, and remember to maximise efficiency wherever possible! Crucially, do not attempt to simply copy and paste this code and convert it to Python - this will not result in a good mark.

Report

In the report you should briefly explain why the algorithm is necessary and how it works, before providing justification for the approaches that you have taken in your implementation, and a discussion around the limitations of the algorithm and your implementation of it. In all cases, we are looking for you to demonstrate that you have understood both the algorithm and what you have done to implement it.

Creating a Random Point (Minimum Bounding Radius)

In the original version, random points are generated using a random distance and direction from the centroid of polygon. You will notice that this is quite different to the approach given in Understanding Distortion, which simply generates a random x an y coordinate based on the bounding box of the feature. Both are equally effective, and there are no more or less marks for either approach - though the bounding box-based approach from Understanding Distortion is likely easier, and perhaps slightly more efficient.

If you do want to use the minimal bounding radius approach, it is achieved simply by the following steps:

  1. Get the centroid of the polygon
  2. Loop through each node in the polygon to find the distance from the centroid to the furthest away node
  3. Generate a random number between 0 and 360 (direction) and between 0 and the distance that you calculated in step 2
  4. Calculate the location of the new point at the specified distance and direction from the centroid - you can do this using the offset() function that we used in week 5

Code Presentation

Broadly speaking, your code should be presented in the following form (all of which should be contained in the appropriate part of the template, marked by '''ALL CODE MUST BE INSIDE HERE'''):

  1. import statements
  2. functions (if any)
  3. main code

You code should be well commented (remember the rule of thirds!) and do not leave in any unnecessaary testing material, such as print() statements that do not contribute to the user experience.

Also make sure that you are using all of the libraries that you have imported - if you load a lot of unnecessary libraries, this demonstrates poor understanding (as well as slowing down your code)!

Error Messages

Remember, when you get an error message in the console, it is not the computer telling you off or making fun of you - it is trying to help you! If you read it carefully it will tell you both what the problem is and even which line of code is causing it - so don’t get upset when you get a lot of errors, find the line of code that is causing the problem, read the description (Paste it into Google if you don’t understand it) and see if you can solve it - you’ll probably find that in most cases you can!

It is also worth remembering that the error message is extremely unlikely to be incorrect (though it can have been caused by another problem just above it); and it is extremely unlikely that there is an error in one of the libraries in the understandinggis environment. If, for example, it says it can’t find a file and you are sure that it’s there, then either your file path or the working directory (top right hand corner of Spyder) is wrong.

The fact is, programming is 20% writing code and 80% debugging it. This doesn’t change as you get more experienced, the errors just get more complicated!!

Hints & Tips

  1. This is not like an essay that you can bash out the night before - the longer you give yourself to complete this, the easier you will make it for yourself.
  2. Be sure to select your parameters (such as \(n\) and \(r\)) for the algorithm carefully - the paper explains how they affect the analysis, so use this knowledge and some experimentation to get the best result for your client - you will definitely want to explain the variables and your selected values to your client.
  3. Note that you are only undertaking this analysis for a single level, so you can ignore the outer (first) loop described in the pseudocode in the paper (which loops through multiple levels).
  4. Look back to the lectures and practical material, and make use of the Hints and Solutions pages.
  5. If you do feel like you are starting to panic - don’t suffer in silence - let me know!

Marking Criteria

Marks will be given based upon:

  • The quality of the report (35% of assignment grade), including:
    • A clear and concise explanation of what you have done and why, demonstrating an understanding of the process and why it is necessary. Consider if you might illustrate the process most effectively using more than one map. You should cover both:
      • The algorithm itself (presented in the paper)
      • Your implementation of it (how you achieved this in Python)
    • A detailed explanation of the limitations of your analysis, including a clear demonstration of the understanding of the Geographical Information Science issues involved.
    • In both of the above, you would expect clear links to the material that we have covered in class as well as the Huck et al. 2015 paper.
    • A brief justification of your chosen CRS.
  • The quality of the algorithm (35% of assignment grade), including:
    • Efficiency (the speed with which your algorithm resolves the problem)
    • Robustness (the script will not fail if it encounters ’normal’ problems, such as incorrect file paths, missing data, unexpected inputs, and so on).
  • The quality of the code (20% of assignment grade), including:
    • Neatness and Elegance (the script is well written and well presented)
    • Comments (demonstrating a thorough understanding of the approaches that you have used)
  • The cartographic quality of the resulting map(s) (10% of assignment grade), including:
    • Aesthetic quality and readability
    • Selection (and justification) of a suitable projection

The key to this is in the name of the course: In both your code and report I want you to demonstrate a clear Understanding of what you have done - this is why comments are so important in your code!

Finished!