Preparation

Overview

This project will focus on the production and development of an application that generates a visualisation or group of visualisations with the aim of simplifying the presentation of large volumes of rapidly changing data, thus allowing the user to easily select, contextualise and analyse both live and historical data.

Specification

Below is a list of all of the requirements for this project, which have been split into the following sections:

Overview
Data collection
Data processing and storage
Output
Graphs
Other

Overview (based on the project description)

An extremely dynamic data source must be used.
The data must be processed in a meaningful way.
A visualisation or multiple visualisations must be produced.
The visualisations must maintain a view of the current data while also updating to show the newly collected data.

Data collection

Twitter's API will be used as the data source for this project.
A library will be used to connect to the API to reduce risk of issues connecting to the API both initially and if the API were to change.
Any filters used when collecting the data from Twitter will be easily modifiable so that the content of the dashboard output can be easily adjusted as and when required.

Data processing and storage

Data collection and processing will be handled separately to ensure that data collection is not impacted by data processing.
The vast amount of data gathered will be processed to reduce the size of data storage whilst still managing to contain all data required for all visualisations that can be produced.
The data will be stored in an efficient way, to ensure that excessive storage is not required to keep the application running.

Output

All data used for the visualisation will be retrieved from the data store used within the application to ensure that the data collection from Twitter is not impacted by running of the application if many dashboards are running at once.
Each graph should be able to be added to a cell and updated using a single function call.
It should be possible to remotely access the dashboard.
It should be possible to have multiple dashboards running at once.
A visualisation container library would be created so that the output of the application would be easily and highly configurable using a JSON file. A possible structure is shown below.

Graphs

Number of tweets per second (early deliverable).
Number of tweets per location, shown on a heat map overlaid on a world map.
Compare number of tweets about multiple topics at once.
Attempt to predict the rise and fall of currently trending Twitter topics.
Create a Venn diagram between the followers, retweets and favourites of a user or tweet.
Show increase or decrease of followers including current trend, i.e. rising or falling.
A graph / tree diagram showing the interaction of users on a tweet. The root node being the initial tweet, and sub nodes being retweets / favourites.
(Other visualisations could be added if any interesting ideas arise while designing and implementing the dashboard.)

Other

The project will be split into multiple modular systems to the point where if required, just one module could be switched out without having a negative impact on the functionality of the application without any changes being necessary. For example, switching the data store module to use Google DataStore rather than a local database.

Risk Analysis

Risk	Possible Consequences	Probability	Severity	Overall Risk	Preventative Measures
Twitter disables public access to their API and data	The dashboard would be unable to function without the Twitter API as its data source.	Low	High	Medium	A few hours of data will be recorded in the format that Twitter provides so that if they decide to disable access then the dashboard can still function from stored data.
Twitter changes their API methods	A substantial amount of connection code may have to be rewritten to use the new methods.	Low	Medium	Medium	The application will be written using a library that has been written to handle connection to Twitter (Tweepy Python library), if Twitter updates their API methods then Tweepy should be updated quickly as there are many programs using this library.
Twitter changes their API response format	A substantial amount of code may have to be rewritten to accommodate the changes.	Low	Medium	Medium	The application will be created as multiple modular sections using APIs to connect them together so that if one part is required to be rewritten the changes will be contained to a single section of the code base rather than having to rewrite code in all / the majority of files.
Tweepy stops updating their repository and the library stops functioning correctly.	Either a different Python Twitter library would have to be implemented or the Twitter connection would have to be managed manually.	Low	Low	Low	As the library is already downloaded (and is stored in this GitHub repository) unless Twitter also changed their API the current files would still be usable. A backup of this repository will be kept (by both myself and Github) to ensure that any files do not get lost.
ChartJS stops updating their repository and the library stops functioning correctly.	Either a different JavaScript graphing library would have to be implemented or the visualisations would have to be created manually.	Low	Medium	Medium	As the library is already downloaded (and is stored in this GitHub repository) the current files would still be usable. A backup of this repository will be kept (by both myself and GitHub) to ensure that any files do not get lost.
JQuery stops updating their repository and the library stops functioning correctly.	The application would have to be implemented using vanilla JavaScript.	Very Low	Medium	Low	The library is extremely popular so this is highly unlikely. As the library is already downloaded (and is stored in this GitHub repository) the current files may still be usable. A backup of this repository will be kept (by both myself and Github) to ensure that any files do not get lost.
My development machine breaks and all data is unrecoverable.	All data would be lost, and I may be unable to work on the project until I was able to purchase a new computer.	Low	High	Medium	All files for this project are stored on Github and/or Dropbox (excluding the database created by the application). The only work that would be lost is work that had not yet been committed, which is normally at most one hour of progress. I would be able to get hold of a new computer to work on within a maximum of 72 hours, so at most three days would be lost. Though it is likely that some progress could be made without the use of a computer (designing a soon to be implemented part of the project).
Internet connection is unavailable for a few hours.	Unable to access live data from Twitter as well as retrieve any new libraries required.	Medium	Low	Low	This would have no long term effects on the project, and although short term it would not be possible to work on certain areas of the project, there would almost certainly be progress that could be made without internet access.
Internet access is unavailable for a number of days or weeks.	Without internet access for longer periods of time the rate of development would decrease, and the project specification may have to be re-evaluated.	Extremely Low	High	Low	There are multiple locations available to me where it would be possible for me to stay, in the highly unlikely event that internet access was unavailable on a university campus. If internet becomes unavailable at all of these locations for a long period of time there will be bigger issues than the completion of this project.
System requirements are not adequately identified	Application does not meet the expected requirements due to the interpretation of the requirements.	Low	Medium	Low	Requirements will be defined specifically so that little or no interpretation is required.
Project involves the use of technology that hasn't been used in a prior project	Extra time will be spent trying to figure out how to use these new technologies.	Medium	Medium	Medium	Popular libraries and technologies will be chosen so that it is likely that there will lots of available resources to provide assistance if required, which will minimise the amount of time that is wasted.
Inadequate estimation of time to complete tasks.	The list of requirements would not be fully implemented to the level that was initially planned.	Medium	Low	Low	When estimating the time to complete tasks, if unsure the expected time will be overestimated slightly rather than underestimated so that if anything, additional time is available to improve the application.
Source code is lost.	A substantial amount of progress that had been made on the project is lost and needs to be completed again.	Low	High	Medium	Files for this project are stored on Github and/or Dropbox (excluding the database created by the application). The only work that would be lost is work that had not yet been committed, which is normally at most one hour of progress.
Some of the source code is accidentally overwritten	A substantial amount of progress that had been made on the project is lost and needs to be completed again.	Low	High	Medium	Files for this project are stored on Github and/or Dropbox (excluding the database created by the application). The only work that would be lost is work that had not yet been committed, which is normally at most one hour of progress.
Some of the source code is corrupted	A substantial amount of progress that had been made on the project is lost and needs to be completed again.	Low	High	Medium	Files for this project are stored on Github and/or Dropbox (excluding the database created by the application). The only work that would be lost is work that had not yet been committed, which is normally at most one hour of progress.
There is poor visibility of project progress	Important tasks within the project could be missed out due to the lack of visibility.	Very Low	High	Medium	A project management system will be used to keep track of both the project as a whole as well as individual parts of the project. This will ensure that there is high visibility of the project's status as long as this system is kept up to date.
The project developer becomes seriously ill or injured.	The project would almost certainly not create the expected application in the same timeframe, either the project specification would have to be re-evaluated or the deadline would have to be extended.	Low	Very High	Medium	Activities that increase the chance of illness or injury, such as extreme sports, will be avoided until the completion of this project.

Proof of Concept

Overview

A basic, dynamically updating visualisation that makes use of the services that are likely to be used for the main project. Referred to as "Proof of Concept" in the rest of the documentation. Including the following:

Twitter API
Google App Engine
A JavaScript library for graphs (d3.js or Chart.js are possible choices)
Bootstrap

Application

Below is a screenshot of the proof of concept that was created:

Screenshot of proof of concept graph.

A video of the proof of concept running is available at this link.

The source files for the proof of concept are available in 'masters-early' directory that was submitted for the early deliverable.

Proof of Concept Review

Important note: Google App Engine cannot be used with the Tweepy API (and multiple other Twitter API libraries) due to port restrictions put in place by Google.

Overall the proof of concept was a success. It showed that it was possible to retrieve tweet data from Twitter using an API, read specific data contained within the tweets, process and store that data on a server, and produce a dynamic graph displaying the processed data on a web application.

Although the application worked as a proof of concept, the following major issues were discovered, which will all be fixed in the implementation of the full application:

Twitter data is ignored when not received chronologically. Due to the nature of the Twitter API, the order of the data that is received is occasionally not in chronological order when measured in seconds. In the 'proof of concept' application the tweets that arrive in the incorrect order are ignored and not processed showing incomplete data on the visualisation. For the next version of the application this needs to be addressed.
Front end does not update when no data is received from the back end. On each request, the back end of the application sends most recent data, if the data that is sent is new, the front end will fill in the empty time with zero tweets sent and then update the graph. If the front end is sent the same data as it has already had, it will continue to wait for new data. This means that it handles no data for x seconds as no update for x seconds, which makes it look as if it has broken, and is no longer updating.
Data appears to move vertically not horizontally. As the graph updates each data point is moved along one space as expected, but the graph does not make it clear that this is happening. To an unfamiliar user, it would probably look like the whole graph is updating each time it changes rather than what is actually happening - one data point is added to the left of the graph, and each of the other points are shifted right one space.
Graph labels do not give the user enough information. Currently the graph is missing the following:
- Both the x and y axis are missing axis titles.
- Both the x and y axis are missing units of scale.
- The x axis labels are extremely long and show far more information than is useful. Currently displayed is "Mon, 27 Jun 2016 05:38:18", the day, month and year are all redundant and should be removed. A better axis label would be "05:38:18", it would be much easier to differenciate each label as they could be a bigger font size, and using shorter labels would also allow the actual graph to be bigger.