Load database from CSV with columns and tables read from file.

The issue at hand here is to load a database with CSV-files in a generic way without having to write separate script for each table.

This short Groovy-script is working based on two conventions:

  • The name of the CSV-file matches the name of the table in the database.
  • The name of the CSV-header matches the name of the column in the table.

The script requires the CSV-files to be read in the folder “to_digest”.

The connection details is placed in a property file, which can be added to the git ignore file if the script is meant to be under GIT version control. In this example configuration variables are stored in the file ‘configuration.properties’ which needs to be in the same folder as the script is executed from.

The property-file contains the following lines:

db_url=jdbc:oracle:thin:@yourdbserver:1521/schema
db_user=yourusername
db_password=yourpassword

Please also note that this script is written for Oracle, and that the Oracle drivers are available in the local Nexus repository as these are not available on the open Internet. This can easily be changed to another database vendor. If you choose to use an Open Source database vendor such as MySQL or Postgres the drivers are available in public maven repositories and you will not need the reference to a local repository through the @GrabResolver annotation.

 

@GrabResolver(name='nexus', root='http://yourorganizationsnexusinstance.you', m2Compatible = 'true')
@GrabConfig(systemClassLoader=true)
@Grapes([
        @Grab(group='com.oracle', module='ojdbc8', version='12.2.0.1'),
        @Grab(group='com.opencsv', module='opencsv', version='3.9')
])


import groovy.sql.*
import com.opencsv.*

import static groovy.io.FileType.FILES


def path = new File("to_digest")
path.traverse(type : FILES, nameFilter: ~/\w*\.csv/) { it ->
    List entries = digestFile(it)
    insertFile(entries.subList(0,1)[0],entries.subList(1,entries.size()),it.name.take(it.name.lastIndexOf('.')))

}


private List digestFile(def path){

    CSVReader reader = new CSVReader(new FileReader(path),(char)';')
    List myEntries = reader.readAll()
    return myEntries
}


private void insertFile(def header, def data, def name){
Properties properties = new Properties()
File propertiesFile = new File('configuration.properties')
propertiesFile.withInputStream {
    properties.load(it)
}
    println name
    Sql conn = Sql.newInstance(properties.db_url,properties.db_user,properties.db_password)
    data.each { rows ->
            String columns = String.join(",", header)
            String values = rows.collect {"'$it'"}.join(",")
            String query = "INSERT INTO ${name} (${columns}) VALUES (${values})"
            conn.execute(query)
    }
}

Work programmatically with Google Spreadsheets Part 2

A while back I wrote a short post on how you can write and read to Google Spreadsheets programmatically using Python and the package ‘gspread’.

Last time the reading was done by first creating arrays with the addresses to where the values could be found in the spreadsheet, and then run through all the values and replace the addresses with the values. It worked fine, but it’s not best practice or very efficient as it makes many single requests on the API. In part two, I will share a short tip on how to read the values in one go instead of iterating through a range of values.

Here is the excerpt dealing with retrieving values. (NB: see original blogpost for gspread initialization).

#Running through the values
get_val = []
set_name = []
set_country = []
for a in range(2,5124):
v = "B%s" % a
sn = "H%s" % a
sc = "G%s" % a
get_val.append(v)
set_name.append(sn)
set_country.append(sc)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

In a recent script we only wanted to download values from a Google spreadsheet (yes, we could have exported the files to .csv with similar result, but with a script we may expand and parse if needed), and this gave some time for refactoring the code as well.

The gspread function worksheet.get_all_values() returns a list of lists with the values. The outer list contains the rows, and the row list contains the specific value of the column at the numerical value for the column. In this example num_streams is the second column, and the position is hence [1] as the list starts at zero.

Also note the nifty way of writing utf-8 formatted strings to the file. UTF-8 can often cause an headache, but put a “u”-character before the string and open the stream with codecs.open(“filename”,”mode”,”encoding”).

The new way of retrieving data from a Google Docs Spreadsheet:

[sourcecode language=”python”]
# -*- coding: UTF-8 -*-
import gspread
import codecs

# Global variables for accessing resource
G_USERNAME = ‘user_email’
G_PASSWORD = ‘password’
G_IDENTIFIER = ‘spreadsheet_identifier’

# Connecting to the data source
gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)

all_val = worksheet.get_all_values()

output = codecs.open(‘output_norwegian_artists.csv’,’wb’, "utf-8-sig")

for l in all_val:
num_streams, artistid, name = (l[1],l[2],l[3])
norwegian = l[4]
if len(norwegian) < 3:
norwegian = ‘NULL’

string = u"{0};{1};{2};{3}\n".format(num_streams, artistid, name, norwegian)
output.write(string)

output.close()
[/sourcecode]

Picture licensed under a creative commons attribution license by the brilliant organization Open Knowledge Foundation. Picture retrieved through a CC-search on Flickr

API-scrape images from Instagram

An image can say more than a thousand words, especially when you add a retro-filter and a score of hashtags to go with it. That is a basic explanation to the functionality of Instagram; the power app which revolutionised peoples creativity when it came to documenting dietary habits… and popularised images in social media.

Instagram brought social into photography in a way other more desktop-oriented photo sharing applications like Picasa and Flick never managed. It is social and users can like and comment on other’s pictures. Instagram also enhances images by reducing their photographic clarity (let’s emulate cameras far less technologically advance by adding a filter), but then again, this adds the style to images, and makes some of them look cool. However, I will let the filters and pastulation (the digital emulation of an analogue past – coined by moi, but please let me know if there is a buzzword for this and I may conform) rest for now. Let us instead focus – pun intended – on something else: the tagging of images, and how to retrieve images with a certain tag.

Adding context and indices with tags

An instagram picture may be tagged with several hashtags, these are words or concatenated words prepended with an hashtag #. The great thing with # is that they are 1) providing a social signifier for the users that this is a tag and hence, the users can use this tag to organise their content and create a social context for which the photo exists e.g. #instafood (picture of a meal), #selfie (a person taking a picture of him/herself usually in together with..), #duckface (quack quack pouting) and #onedirection (popular teenage idols). Tags can be of any kind, from current affair, to more general stuff. 2) providing a token to declare something indexable for the Instagram-server and other technical resources. Once the computer system knows it’s a tag it may group the tags together, perform analysis on the tag and users associated with this tag, aggregate statistics on the tag and other stuff to enhancing the user experience. In our case the tagging is great as we want to retrieve images with a given tag.

The #InstagramTagGetterScript

Below is a script which takes the tagname as an argument and downloads the images and meta-data associated with these images. To get it to work you will need to obtain an API-key from Instagram’s developer page. This URL you can put into the inital request sent to the server (that being stored into the next_url variable). We are using the tags-endpoint to download the images.

The rough outline of the script is as follows: 

First we define a class to store each InstaEntry, and this class comes with the functionality to retrieve and store the image and metadata, as well as dump the data to disk and load the data from disk. The class holds all the variables we are interested in collecting, and once instantiated these variables are set unless they do not exist with the image.

Once the structure is created some inital parameters are set: the tag and our initial URL-request, and the folders into where we will store data are created. When everything is set up we run a loop which continues to run as long as there are data available and we get responses with HTTP 200-status (OK). The loop instantiates an InstaEntry for each image which then download images as well as metadata on the fly. The objects are retained until the program is fully executed, but all large data (see images) are downloaded directly and not kept in memory.

Please contact me if you want to use this script, tailor it, or have any questions related to it.

#!/usr/bin/ruby
# encoding: UTF-8

require 'active_support'
require 'restclient'
require 'csv'
require 'open-uri'
require 'fileutils'

class InstaEntry
  attr_accessor :id, :username, :picture_url, :likes, :filter, :location, :type, :caption, :tags, :fullname, :user_id, :created_time, :link

  def initialize(id)
    @id = id
    @@threads = []
  end

  def marshal_dump
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def marshal_load(variables)
    @id = variables[0]
    @username = variables[1]
    @picture_url = variables[2]
    @likes = variables[3]
    @filter = variables[4]
    @location = variables[5]
    @type = variables[6]
    @caption = variables[7]
    @tags = variables[8]
    @fullname = variables[9]
    @user_id = variables[10]
    @created_time = variables[11]
    @link = variables[12]
  end

  def to_arr
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def self.get_image(obj,tag)
    @@threads << Thread.new(obj,tag) {
      begin
        open("images_#{tag}/#{obj.id}_#{obj.username}_.#{obj.picture_url.match('\.(jpe?g|gif|png)')[1]}","wb") do |file|
          file << open("#{obj.picture_url}").read
        end
      rescue
        puts "ERROR: #{obj.id} triggered an Exception in get_image method"
      end
    }
  end

  def self.print_metadata(obj,tag)
    open("md_#{tag}/#{@id}_#{@username}.txt","wb") do |file|
      file.print(obj.to_arr)
    end
  end

end #end InstaEntry class

#
# This block sets the parameters, and reads the first word for keyboard to be file
#

raise ArgumentError, "Missing name of tag to download" if ARGV.length < 1

$tag = ARGV[0]

output = open("output.json","wb")
next_url = URI::encode("https://api.instagram.com/v1/tags/#{$tag}/media/recent?access_token=51998418.d146264.e77441adc4a04399874a19b48bb91e71f&min_id=1")
# NB: The access token above is similar to a token, but obfuscated. Get your own by retrieving a developer account at Instagram.
puts next_url

unless File.directory?("md_#{$tag}")
  FileUtils.mkdir_p("md_#{$tag}")
end

unless File.directory?("images_#{$tag}")
  FileUtils.mkdir_p("images_#{$tag}")
end

count = 0
instas = {}

#
# This blocks run through all the subsequent pagination pages. Stop when stumbles upon HTTP code not being 200 or if the access string is shorter or like 5 characters.
#
begin
  response = RestClient.get(next_url)
  json = ActiveSupport::JSON.decode(response)
  pretty_json = JSON.pretty_generate(json)
  puts "Status code #{json['meta']['code']} for URL #{next_url}.. Fetching"
  next_url = json['pagination']['next_url']
  sleep 2

# loop through the data elements
json['data'].each do |item|
  puts item['link']
  puts item['user']['full_name']
  ie = InstaEntry.new(
    item['id'])
  instas[item['id']] = ie

  ie.username = item['user']['username']
  ie.picture_url = item['images']['standard_resolution']['url']
  ie.likes = item['likes']['count']
  ie.filter = item['filter']
  ie.location = item['location']
  ie.type = item['type']
  ie.caption = item['caption']['text'] unless item['caption'].nil? or item['caption']['text'].nil?
  ie.tags = item['tags']
  ie.fullname = item['user']['full_name']
  ie.user_id = item['user']['id']
  ie.created_time = item['created_time']
  ie.link = item['link']

  InstaEntry.get_image(ie,$tag)
  InstaEntry.print_metadata(ie,$tag)
end

count += 1

output << pretty_json

puts "Now checked __ #{count} __ files and __#{instas.length}__ number of instas"
puts "*****Ending with #{count} __ files and __#{instas.length}__ number of instas****" if next_url.nil?

end while not next_url.nil?

output.close

File.open("instadump_#{$tag}",'wb') do |f|
  f.write Marshal.dump(instas)
end

CSV.open("output_#{$tag}.csv", "wb", {:col_sep => "\t"}) do |csv|
  instas.each do |k,v|
    csv << instas[k].to_arr
  end
end

 

 

Disclaimer: Enabling you to download images associated with tags does not make me say that you can do whatever you want to. First, please refer to the Instagram guidelines to confirm that you are actually allowed to download images. Second, respect the individual users privacy and immaterial content rights, do not use images in a publishing context without the users consent. Generally: be nice, and do good. 

Prepare Your Data For Analysis: Load Data With Spoon

Note: The instructions described here is learnt and utilized while working on the Clouds & Concerts project at University of Oslo in close collaboration with Xin Jian, Johannes Bjelland og Pål Roe Sundsøy under supervision by Arnt Maasø. The credit for methodology and SQL ninja tricks is therefore to be attributed to all members of the fantastic stream team. Inconveniences or mistakes are solely my own.

Technical set up

We chose to use Amazon as a solution for doing our data-analysis. Amazon Web Services provides both virtual computers, databases and storage in the cloud. All services are scalable so data loading and manipulation computer can be put dormant when idle, and the database can be scaled up when doing complex and data-intensive operations.

DSV files – choosing delimiter

An excerpt from a CSV file, note the tabulator distance between the occurrence's variables and lineshift between the occurrences.
An excerpt from a CSV file, note the tabulator distance between the occurrence’s variables and lineshift between the occurrences.

When moving large database files between two different servers in different organisations the standard way of exchanging data is through flat text files. The flat text files may have a structure e.g. XML or JSON, but in n the case of working with relational databases where all entries, or rows, have the same number of attributes, or columns, DSV is a convenient format. DSV is the abbreviation for delimiter separated values, and is a format where each attribute and each entry are separated with an assigned delimiter token. The most frequent delimiters are comma between attributes and newline between entries. This format is known as CSV.

When choosing delimiters, it is important to find a character that is not found in the data values. We chose to use tab as delimiter. To save space and transfer cost the files should be zipped (NB: the .zip-container supports only upto 4GB. If you have any larger files, use .gzip or .tar.gz).

Data delivery S3 bucket

Amazon’s Simple Storage Service (S3) is an easy way to retrieve data, either by using a client or using the web-interface accessible through the AWS console. All kinds of files may be uploaded to S3 including zipped or tar.gz-ed files. Files can easily be uploaded and downloaded to and from S3 by those who have access, and the web-uploader supports large files. You don’t have to define predefined storage space either since this is allocated on the fly.

Using Pentaho Spoon for loading

The software company Pentaho provides good tools for working with data-analysis, and part of their suite is the ETL-tool (Edit, transform, load) Spoon. This also comes in a community edition which can be downloaded from their homepage. The tool is built on Java so it can be run on different platforms. Once installed, use the terminal to navigate into the ‘data-integration’ folder, then set execution rights and run ‘spoon.sh’. This should open the spoon ETL-tool.

The Pentaho spoon workbench let you transfer data between "modules". In this picture a CSV input module is passing data to a table output module.
The Pentaho spoon workbench let you transfer data between “modules”. In this picture a CSV input module is passing data to a table output module.

In Penthao you can make a transformation for each table you need to transfer, alternatively you can join more transformations together in jobs. To do a basic transfer of data from CSV, or other delimited files, to a database server you will need to use two modules: CSV file input and table output.

Set up a super-easy transformation to load csv to table

In the Pentaho, the convenient module CSV file input can be used for reading data from delimited files. 1: Use the file finder to select the file, set the delimiter and enclosure (e.g. if the strings are prepended and appended by quote marks). If you have a file dump where the header row is present, tick of this option. 2: The CSV file input module can try and determine the format of the various tables based on the data in the files. Use the ‘Get Fields’ option, and select a good sample size to determine the format.

The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
Loading tip: represent values as strings.

If you have a file with a high risk of data errors, for example a file where the columns overflow, I usually set the fields to text, using the functionality where I can define variable data-type (3), and make sure the length is large enough to encompass the data. This enables us to do the changes, and alterations of data format on the database server once the data is loaded. Most data can be represented in a string. Both “12354”, “182,54” and “01/08/1986 00:00:00,000000” can be represented as a string, but the first example can also be an integer, the second a real or float, and the third a timestamp. We can parse the data when we have them loaded into our SQL-database. You can adjust the type in the CSV file input module.

Use Table Output to write read rows to SQL-server

The other Pentaho module needed for this basic transformation is the Table output. Once Pentaho is configured with valid database configuration, this module can access the database directly so you can select to which table to write the data. First drag a line between the CSV Input module to to the Table Output to get the data connection. Then select output database/target schema and target table . If you have setup the table structure through a DDL (The Data Definition Language – The structure of the tables) you can map the fields from the CSV file input. In the database field tab you first select ‘Get fields’ to get the fields found or created in the CSV input module. Then select either ‘Enter field mapping’ if you have the full DDL loaded on the database or ‘SQL’ to create the DDL followed by ‘Enter field mapping’ (see tip below). Once the table is connected and the fields from the input data is connected to the fields of the database you are ready to run the transformation. Return to the workbench and press the play symbol on the top of the transformation tab.

Table loading tip: Let Pentaho deal with DDL
Use Pentaho to help you fill out the table structure
Use Pentaho to help you fill out the table structure

Once you have defined a connection by making a line between the CSV file input and the Table output, you can load the fields as they are stored in the flat CSV file using the ‘Get fields’ button (1). If you already have a defined table definition you can go straight to the mapping (3), but if you don’t have a DDL or you want the database table structure to follow the field definition you declared in the CSV file input you can use the SQL button (2) to get Pentaho to create the DDL and execute this on the database. To be able to do this you first have to create a table for the ‘target table’ field to be filled out. The trick is to know that you only have to fill in one variable when you create the target table e.g. execute the command “CREATE TABLE ALMOST_EMPTY_TABLE (ID_COLUMN INTEGER)” on the database, and let Pentaho deal with creating the rest. After running the SQL, you can map the fields (3) and then run then transformation.

Validate that the data is loaded into the database.

Check the number of loaded rows

In Pentaho Spoon, the transformation will have a step matrix when running where you can monitor the progress. During the load and upon completion the total number of rows will show here. To be sure that all data is loaded, or to get a number of how many rows that was dropped you can read this from the step matrix, but also by double checking the number of lines in the CSV file and the number of rows in the database table. The number of lines in the CSV file can be found by using the command “wc -l filename.csv” in the folder where the CSV files are stored (if the file have a header subtract the result by one), and compare this to the number of rows in the loaded table using “SELECT count(*) FROM tablename”.

Your data is now loaded into the database.

Useful Terminal Commands for work with data

Please, see my terminal-tools tag for more articles on nifty terminal tools.

Enclose values in quotes, and add comma

Say you work with a large datafile where you get all values printed on consecutive lines, and you want to copy these lines into an array, list or other data structure where strings need to be quoted, and values needs to be separated by commas. Here is a sed script that may help you with that task.

sed 's/\(.*\)/"\1",/g' all_uuid.csv > all_ready.txt

What this CLI snippet does is to use the sed, stream editor, and pass in the s – substitute – argument followed first by a backslash delimiter and then by a eat all regular expression enclosed in a group. This reads the whole line and is finished by yet another backslash delimiter. What comes between the first and second delimiter is the input, and now comes the output. We write the initial quote mark, then a \1 which is referring to the regular expression group in the input, this is followed by a closing quote and a comma. We add the g-argument to continue this operation for all matches, in other words all lines. Then we pass in the file we want to alter, and sends in the > operator to print the output of this command into the file all_ready.txt.

Copy a specific number of lines from one file to another

While working with large data you want to try something new. The problem is that it takes too much time running the process on a too large data-selection. How can you cut down on the length without opening the text editor. With the head command you can print the beginning of a file, and by adding a number as argument you can define exactly how many lines to add. The lines are printed to stdout, but by using the greater than operator you can route this output into a new file.

head -50000 oldfile > newfile

The head command reads the file from the beginning and is usually reading a default number of lines. We pass in an argument asking for 50.000 lines then the file which we want to read, and pass in the greater-than operator to write the output into a new file.

Getting the total number of lines of a textfile

Working with data, and transfer from one system to another, you will often find that the data is saved onto flat files. Flat files can be comma separated (the well known CSV-files), tab-separated or separated by other means. The general pattern is that attributes of the data (the columns) are stored separated by a predefined delimiter, and the occurrences (the rows) are separated by a new line. For this command to work you need the format to be of this type. Use e.g. head to verify that occurences are separated by new lines. When that is done, run the command below, and if you have a header line in your flat file substract 1 from the number of lines. This can easily be compared with the result of count(*) on the SQL database after the data is imported.

wc -l filename

Getting the total number of files in a directory

In this example we use the command from above, but this time we pipe in the output of the directory list (ls) with the each file on a new line argument (-l). With piping we can send the result of one command as the argument for a second. This kind of chaining makes *nix systems very convenient as you can do complex stuff from combining simple commands. Anyhow, here the wc -l from above gets a list of files and directories in a directory and prints the count of these.

ls -l [directory-name] | wc -l

Illustrational image by Travis Isaacs, licensed under a Creative Commons attribution licence. Found on Flickr.

Sanntid hjemme: Følg Ruter hjemmefra med Arduino og Python

Sanntidsskjermene som Ruter har satt opp ved flere t-bane, trikk og bussholdeplasser har blitt et kjærkomment bidrag til informasjonen du som reisende får om avganger. Ikke bare kan du ha stålkontroll over hvilke avganger som går, og i hvilken rekkefølge, du blir også oppdatert om estimert ankomsttid slik at du vet at du mista bussen du løp for å rekke eller om bussen har blitt forsinka i rush-trafikken.

I tillegg til skjermene på holdeplassene som opplyser om sanntid, kan du få tak i app-er til mobiltelefonen og du kan også se dataene på forskjellige tredjepartsløsninger som f.eks på informasjonsskjermene ved Instiutt for informatikk ved Universitetet i Oslo. At dataene kan vises på mange forskjellige steder er mulig siden Ruter legger ut all ruteinformasjon til fri bruk (noen restriksjoner gjelder) på nettet. Det er dette vi kan benytte oss av når vi skal lage vårt eget sanntidssystem.

Informasjonsskjerm med sanntidsinformasjon

Vårt ferdige produkt vil være et sanntidssystem, ikke ulikt det man kan finne på holdeplassene. Selvsagt blir vårt produkt mindre robust og pålitelig, men i prototypings-fasen er dette helt greit. Skjermen skal vise hvilken linje neste avgang gjelder, hvilket navn det er på denne linjas endestasjon og hvor mange minutter det er til ankomst/avgang. For å få til dette benytter vi oss av en “dum” skjerm og av programvare som skriver teksten til denne. Skjermen vi bruker kobler vi til datamaskinen igjennom et Arduino grensesnitt. Vi bruker datamaskinen for å gjøre det meste av jobben, og vi legger mest mulig kompleksitet hit.

Informasjonsskjermen

Arduino er et veldig bra sted å begynne for å lære elektronikk, og hvordan vi kan kombinere dette med datamaskiner. Arduino har mange pedagogiske resursser tilgjengelig, og er laget for å være enkelt å komme igang med samt fleksibelt nok til å kunne skreddersys med egne komponenter.

Alle komponentene vi trenger følger med i Arduino innførings-pakke (Starter Kit). Vi tar utgangspunkt i oppgave 11, hvor vi skal lage en krystall-kule. Vi benytter oss av skjema-tegningene for å koble en LCD-skjerm til Arduino-brettet.

På Arduinobrettet laster vi opp følgende Sketch:

#include

LiquidCrystal lcd(12,11,5,4,3,2);

char a;

void setup() {
  Serial.begin(9600);
  lcd.begin(16,2);
  lcd.print("Ready for");
  lcd.setCursor(0,1);
  lcd.print("duty!");
}

void loop() {
  while (Serial.available() > 0 ){
    a = Serial.read();
    if (a == '!') {
      lcd.clear();
      lcd.setCursor(0,0);
    } else if (a == '?') {
      lcd.setCursor(0,1);
    } else {
      lcd.print(a);
    }
  }
}

 

 

Ardino-editor
Arduino-editoren

Vi benytter oss av LiquidCrystal biblioteket som følger med Arduino for å gjøre det enklere å skrive til brettet. Vi setter opp en Seriell-forbindelse og skriver en velkomstmelding til LCD-skjermen. Når LCD-skjermen er initialisert er det i hovedsak tre metoder vi benytter: print, clear og setCursor. Førstnevnte skriver en melding til skjermen, clear fjerner all tekst. setCursor benytter vi for å hoppe mellom skjermens to linjer. Metoden setCursor(0,0) setter skrivehodet øverst til venstre, og setCursor(0,1) setter skrivehodet til venstre i nedre rad.

Det er veldig lite kompleksitet på dette nivået, det eneste vi ønsker å gjøre er å kunne skrive, flytte mellom linjene og fjerne all tekst fra skjermen. Så lenge det er data i seriell-bufferet leser vi et tegn, og skriver dette til skjerm med mindre det er et spesialtegn. For dette programmet er “!” og “?” definert til å ha spesielle funksjoner. Utropstegnet fjerner all tekst og resetter skrivehodet til øvre venstre posisjon, spørsmålstegnet flytter skrivehodet til nedre venstre posisjon.

Skrive til skjerm

Når LCD-skjermen er satt opp og Arduino-koden er lastet opp kan man skrive til skjermen. Bruk CoolTerm (kan lastes ned her) og koble denne til Arduinobrettet (la brettet være koblet til via USB porten). Nå kan du skrive tekst som dukker opp på skjermen, samt kontrollere at spesialtegnene fungerer. Fungerer det? Kult, la oss koble oss opp mot Trafikanten/Ruter.

Trafikkdata fra Ruter

Ruter har offentliggjort mye av sin reiseinformasjon, og de har lagt ut sanntidsinformasjonen i tillegg til reiseplanleggeren, avvik og mye annet snasent. Les mer på Ruter sine API-sider

Vi skal benytte oss av to datasett fra Ruter, først må vi finne vår stasjons-ID ved hjelp av en kommaseparert fil med oversikt over alle stasjonene. Denne semi-statiske filen har meta-data som vi trenger for å kunne finne riktig stasjon fra REST JSON sanntids-APIet.

Finne stasjons-ID

For å finne riktig stasjon gjorde jeg en pragmatisk avgjøre om å ikke parse CSV-fila med alle stasjons-IDene. Jeg bare åpnet fila i TextMate (som jeg benytter for å redigere kode), og søkte etter Blindern T som ligger i nærheten av både jobb og hjem. Jeg fant ut Blindern T sin stasjons-ID er: 3010360

Hente og parse trafikkinformasjon

Oversiktlig, men maskinlesbar JSON strøm. Alle data vi trenger
Oversiktlig, men maskinlesbar JSON strøm. Alle data vi trenger

Ved å gå til datapunktet for sanntidsinformasjon ved denne IDen får jeg en JSON strøm/fil tilbake med alle avanger fra denne stasjonen: Blindern datapunkt.

I denne strømmen får vi mange objekter med avganger, og alle avganger har masse data vi kan bruke. JSON strømmen er ikke lett å lese og forstå, men så er den heller ikke laget for mennesker. Jeg benytter programmet VisualJSON for å lese JSON-filer.

Visual JSON is a handy tool for getting a more human readable representation of the data in the JSON stream
Visual JSON er et praktisk verktøy for å gjøre JSON-strømmer mer menneskevennlige

I VisualJSON kan vi se hvilke verdier i hver oppføring vi ønsker å ta vare på. DestinationDisplay inneholder navnet på endestasjonen, og LineRef inneholder linjenummeret. Praktisk å ta vare på disse. I tillegg trenger vi å lese ExpectedArrivalTime som har et noe merkelig tidsformat: “/Date(1378070709000+0200)/”.

Se hele implentasjonen? Mot slutten av siden finner du scriptet som henter, bearbeider, mellomlagrer og sender dataene skjermen.

Bearbeiding av datoene

De fleste dataene for hvert avgangsobjekt er klare for visning, men når det gjelder datoene trenger vi å gjøre noen endringer. Formatet vi fikk disse i er vanskelig å forstå for mennesker. Datamaskiner kan enkelt regne ut hvor mange sekunder som har passert siden torsdag 1 January 1970 (tidsformatet Epoch time), men for mennesker er dette vanskeligere. Vi parser derfor epoch formatet til datetime, og i fra dette formatet finder vi time delta (tidsdifferansen) mellom når avgangen er forventet og nå. Når vi tar denne tidsdifferansen og gjør den om til minutter har vi antall minutter til avgangen som vises.

Mellomlagring av data

Jeg har valgt å lage en klasse som inneholder de nødvendige dataene for hver avgang. Objektene vi oppretter av denne klassen legger jeg inn i en liste, hvor jeg henter ut ett og ett element. Når lista er tom går jeg til nett-tjenesten for å hente nye data. Det er fint å slippe å gå til tjenesten hele tiden for å hente data, samtidig må vi også bli oppdatert på eventuelle forsinkelser eller dersom antatt ankomsttid skulle flyttes framover i tid. Jeg syntes at å lagre alle avgangene i en kø for så å hente nye data når denne er tømt er en god løsning på denne utfordringen.

Formatere data for LCD-skjerm

For at dataene skal vises riktig må vi formatere de. Måten jeg har valgt å gjøre dette på er ved å lage setninger for hver linje. Altså først sende en melding om at skjermen må tømmes og resettes, deretter øverste linje med linjenummer og endestasjonsnavn. Når dette er sendt, venter jeg 0.2 sekund før jeg sender linjeskift-kommandoen (“?”), og linje nummer to. Jeg venter 0.2 sekunder mellom hver melding for å ikke mate skjermen med for mye informasjon samtidig. Når all tekst vises lar jeg denne stå i 4 sekunder før jeg går videre til neste avreise.

Implementasjon av programvaren

Nedenfor finner du hele scriptet som henter data fra Ruter, og som sender dette til Arduinobrettet

import urllib2, json, datetime, re, serial, time
from pprint import pprint

class TrafficInfo:
  def __init__(self, line_ref, destination, arrival_minutes, monitored, platform):
    self.line_ref = line_ref
    self.destination = destination
    self.arrival_minutes = arrival_minutes
    self.monitored = monitored
    self.platform = platform


def days_hours_minutes(td):
  return td.days, td.seconds//3600, (td.seconds//60)%60

def fetchTrafficFromStation(station_id):
  url_loc = "http://reis.trafikanten.no/reisrest/realtime/getrealtimedata/%s" % station_id
  response = urllib2.urlopen(url_loc)
  data = response.read()
  return data

station_id = "3010360"
ser = serial.Serial("/dev/tty.usbmodemfa131", 9600)
arr = []

while(True):
  if len(arr) == 0:
    print "Fetching new data"
  data = json.loads(fetchTrafficFromStation(station_id))

  for el in data:
    timecode = el["ExpectedArrivalTime"]
  m = re.split("(\d{13})\+(\d{4})",timecode)
  dt = datetime.datetime.fromtimestamp(float(m[1])/1000)
  tn = datetime.datetime.now()
  d = dt - tn
  td_human = days_hours_minutes(d)
  minutes = False
  if (d.seconds < (3600 - 1)): minutes = td_human[2]
    ti = TrafficInfo(el["LineRef"], el["DestinationDisplay"], minutes, el["Monitored"], el["DeparturePlatformName"])
  arr.append(ti)

  time.sleep(1)
  ar = arr.pop(0)
  time.sleep(1)
  ser.write("!")
  time.sleep(0.3)
  over_str = "%s : %s" % (ar.line_ref,ar.destination)
  ser.write(over_str)
  time.sleep(0.3)
  ser.write("?")
  time.sleep(0.3)
  under_str = "%s %s" % (str(ar.arrival_minutes), "Minutes")
  ser.write(under_str)
  time.sleep(4)

 

 

Du trenger ikke nødvendigvis en LCD-skjerm for å hente trafikkdata fra Ruter, og du trenger ikke nødvendigvis å vise trafikkdata dersom du har en LCD-skjerm. Det er bare fantasien som setter grenser. Hva med for eksempel værdata fra yr.no, eller temperaturen i kjøleskapet? Arduino er et flott sett for å leke med noen av mulighetene elektronikk og informatikk gir deg.

Appropos Arduino: I forrige uke var jeg med på en workshop med Tom Igoe, som har skrevet boken Making things Talk. Jeg har selv ikke lest denne boka, men har hørt at den er en god kilde til kunnskap og inspirasjon. Check it out!

Screenscrape av Øya-programmet

Forskningsprosjektet Sky & Scene, hvor jeg jobber, ser blant mye annet nærmere på strømmetallene fra WiMP før, under og etter Øya-festivalen. For å gjøre dette trenger vi en liste over hvilke artister som spiller, hvilken dag de spiller og når på dagen de spiller. Før dataene kan analyseres må disse dataene være tilgjengelige i Excel-ark og i CSV-format og i databasen hvor strømmetallene finnes. Dataene må hentes og struktureres i et bestemt format.

Et godt utgangspunkt er å samle dataene i et CSV-format. CSV står for Comma separated values, kommaseparerte verdier, og er en liste hvor verdiene er for en forekomst er samlet på en linje, og hvor forekomstens data-attributter, også kalt variabler, er separert med – you guessed it – komma. Et lignende format kan du finne i Excel hvor èn forekomst finnes på èn linje, og denne forekomstens variabler oppgis i kolonner.

Finne dataene

Ok, nok om formatering. Hvor kan vi finne dataene? Et naturlig utgangspunkt er festivalens hjemmesider. På oyafestivalen.com (den engelske hjemmesiden til festivalen) finner vi et menyvalg kalt “program“, og her finner vi også programmet.

developer_menu_oyafestival
Utviklerverktøyet til Chrome kan finnes i menyen. Dette er et veldig nyttig verktøy for både web-utvikling og screen scraping

For å screen scrape programmet hjelper det lite med den visuelle presentasjonen av siden og vi må derfor se på HTML kilden. I Google Chrome finner du denne ved å høyreklikke i web-vinduet for så å klikke på “vis sidekilde”, her kan vi finne HTML-koden. Eventuelt kan du kopiere denne lenken inn i din Chrome browser: “view-source:http://oyafestivalen.com/program/#all

Dersom du gikk inn i kildekoden vil du se at listen med artister mangler. Hvorfor? Jo, fordi listen er ganske lang og benyttes av flere kilder lastes ikke listen med programmet inn av selve program-siden. Den lastes inn asynkront med AJAX (Asynchronous Javascript and XML). Finn fram Chrome Developer Tools som finnes i menyen, og gå til Network fanen. Last siden igjen ved å klikke på sirkelen med pil til venstre for URL-feltet.

Her kan du se at en fil kalt getArtist.php er lastet (bilde 1), og at denne filen ikke lastes som en del av originalforespørselen vår til web-tjeneren, men istedet er lastet inn via Javascript. Dersom vi klikker for å se på hva denne URL-en leverer kan vi se at artistlisten kommer herifra. URLen til siden kan du finne ved å høyreklikke på navnet getArtist.php og så velge “copy link address”.

Når du har URLen (http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php) kan du kopiere denne inn i nettleser vinduet ditt. Du skal nå få en liste uten spesiell formatering som ser omtrent slik ut:

artistliste_oyafestival
Øyafestivalens artistliste hentes fra serveren asynkront for å spare tid når hovedsiden lastes. Nå har vi funnet dataene vi trenger.

OK, nå har vi funnet dataene vi trenger. Nå må vi bare finne en god måte å hente de ut fra siden. La oss ta en titt på kilden bak konsertlista. Her finner vi både dataene og strukturen vi trenger:

Dataene vi trenger, men med en annen formatering. Uansett, nå gjenstår bare hentingen og reformateringen.
Dataene vi trenger, men med en annen formatering. Uansett, nå gjenstår bare hentingen og reformateringen.

Her kan vi se at:

  1. Ytterst har vi en div-tag med klassen “table title”. Denne innleder forklaringen som står over kolonnen i visningen.
  2. Vi har en uordnet liste (ul-tag) med klassen “table”
  3. Den uordnede listen har flere barn som er satt i liste elementer (li). Disse benytter seg av HTML5 data-attributter, men disse skal vi ikke bruke i denne omgang.
  4. Hvert liste-element har et span element med klassen “name”, hvor innholdet er navnet på artisten
  5. Liste-elementet har også en klasse “scene” med scene navnet som innhold.
  6. Sist har liste-elementet også en “date” klasse med de tre første bokstavene på dagen, tre non breaking spaces (HTML syntaks: &nbsp;) og tidspunkt for konsert-start.

Her finner vi alle dataene, og formateringen er også lik for alle elementene i lista med klassen “table”.

Når vi nå har funnet datakilden kan vi begynne å trekke ut dataene for videre bruk.

Screen scrape med Ruby og Nokogiri

Vi har nå funnet kilden og da kan vi benytte oss av Ruby og biblioteket (ruby-term: gem) Nokogiri.

Før vi begynner å hente dataene må vi gjøre klart scriptet som skal hente dataene fra festivalens hjemmeside. Vi inkluderer nokogiri som skal hjelpe oss å parsere datakilden. Samtidig laster vi også inn csv-bibliotek for å skrive ut filene og open-uri for å kunne lese URI-kilden som en fil.

[sourcecode language=”ruby”]
#!/usr/bin/ruby
# -*- encoding : utf-8 -*-

require ‘nokogiri’
require ‘open-uri’
require ‘csv’
[/sourcecode]

Konsert klassen

For å lagre og manipulere dataene lager vi en klasse for å lagre de fire verdiene vi trenger: artist, scene, date og datetime. Hos kilden finner vi de tre første verdiene og datetime konstruerer vi utfra date.

For klassen setter vi alle variablene vi skal benytte med en attr_accessor. Dette gjør at ruby selv genererer get og set-metoder for alle variablene listet etter funksjonen, noe som gjør at vi fritt kan hente og sette variablene fra instansene av klassen.

Vi skriver en initialize-metode, en konstruktør, som kalles når instansen opprettes. Siden vi allerede henter artist, scene og dato fra datakilden kaller vi konstruktøren med disse variablene slik at disse settes. For å oversette date til datetime, lager vi en dictionary med dagene og tilsvarende ISO-datoformat.

Legg merke til at når instans-variabelen @date settes, så gjøres det en del formatering. Fra kilden får vi datoformatet noe annerledes, så vi fjerner non-braking space, og bytter ut punktum med semikolon og sørger for at det er mellomrom mellom de tre bokstavene som angir dagen, og klokkeslettet. Når dette er gjort kaller vi en metode for å generere datetime-verdien basert på date-verdien. Vi bruker @ foran variabelnavnet for å markere at dette er en instanse-variabel.

metoden add_datetime gjør et oppslag i date_dict og bytter ut dag-bokstavene med ISO-dato, deretter henter den ut tidspunktet fra @date variabelen og interpolerer disse to verdiene til en datetime string.

Den siste metoden vi lager to_arr tar alle instanse-variablene og returnerer disse som en array. Siden CSV-funksjonen vi inkluderte tidligere kan lage en CSV-linje fra en array er dette en hendig måte å hente ut verdiene fra objektet.

[sourcecode language=”ruby”]
class Concert
attr_accessor :artist, :scene, :date, :datetime
def initialize(artist, scene, date)
@date_dict = {‘wed’ => ‘2013-08-07′ ,’thu’ => ‘2013-08-08′ ,’fri’ => ‘2013-08-09′ ,’sat’ => ‘2013-08-10’}
@artist = artist.strip
@scene = scene.strip
@date = date.gsub(/\u00a0/, ”).gsub(‘.’,’:’).gsub(/([a-zA-Z]{3})(.)/,’\1 \2′).strip
self.add_datetime
end

def to_arr
return [self.artist, self.scene, self.date, self.datetime]
end

def add_datetime
@datetime = "#{@date_dict[@date[0,3].downcase]} #{@date[4..9]}"
end

end
[/sourcecode]

Lese dokumentet, hente ut dataene og lage objektene

Når vi nå har en datastruktur hvor vi kan lagre informasjonen, kan vi begynne å hente informasjonen fra internett. Aller først lager vi igjen en tom dictionary hvor vi ønsker å lagre våre konsert-objekter etterhvert som vi lager disse.

Vi bruker Nokogiris HTML klasse og lagrer denne til doc variabelen. Til denne sender vi en tekst-strøm som hentes fra URLen. Vi sender altså samme tekst som vi fikk fra getArtist.php kildekoden til Nokogiri.

Nokogiri har en utmerket methode kalt css. Denne metoden tar en CSS (Cascading Style Sheet) selektor og finner riktig element fra DOMen (Document Object Model) som Nokogiri holder. Vi ønsker å iterere over alle “.table li”-nodene (alle li-nodene under table-klassen), og gjør dette ved enkelt med .each metoden.

For hver “.table li” vi itererer over, henter vi ut innholdet av elementene som har klassene .name, .scene og .date og oppretter et objekt av Concert-klassen. Det siste vi gjør for hver iterasjon er å lagre objektet med artisten som nøkkel i vår concerts dictionary.

[sourcecode language=”ruby”]
concerts = {}

doc = Nokogiri::HTML(open(‘http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php’))
doc.css(‘.table li’).each do |el|
a = Concert.new(el.css(‘.name a’).first.content,
el.css(‘.scene’).first.content,
el.css(‘.date’).first.content)
concerts[a.artist] = a
end
[/sourcecode]

Printe objektene som CSV

Når vi har opprettet alle objektene ønsker vi å skrive ut alle variablene i disse til fil. Vi gjør dette ved å åpne en fil kalt output.csv med skrivetilgang. Deretter itererer vi igjennom alle objektene og bruker nøkkelen fra k-variabelen til å hente ut hvert enkelt objekt som finnes i vår concerts dictionary. For å kun få Øya-festivalens konserter (ikke klubb-Øya) sjekker vi at konserten fant sted på enten scenene “Enga”, “Klubben”, “Sjøsiden” eller “Vika” (Sjøsiden har feil format her som vi senere korrigerer i Excel). For hvert objekt hvis scene er inkludert blant Øya-scenene skrives det en linje til csv-fila som tar en array med verdier. Denne arrayen hentes fra to_arr metoden vi skrev i Concert-klassen.

[sourcecode language=”ruby”]
CSV.open("output.csv", "wb") do |csv|
concerts.each do |k,v|
csv << concerts[k].to_arr if [‘Enga’,’Klubben’,’Sjøsiden’,’Vika’].include? concerts[k].scene
end
end

[/sourcecode]

Sånn. Nå burde du ha en CSV med alle Øya-artistene som du kan enten importere til en database eller åpne i Excel.

Hele scriptet:

[sourcecode language=”ruby”]

#!/usr/bin/ruby
# -*- encoding : utf-8 -*-

require ‘nokogiri’
require ‘open-uri’
require ‘csv’
require ‘open-uri’

class Concert
attr_accessor :artist, :scene, :date, :datetime
def initialize(artist, scene, date)
@date_dict = {‘wed’ => ‘2013-08-07′ ,’thu’ => ‘2013-08-08′ ,’fri’ => ‘2013-08-09′ ,’sat’ => ‘2013-08-10’}
@artist = artist.strip
@scene = scene.strip
@date = date.gsub(/\u00a0/, ”).gsub(‘.’,’:’).gsub(/([a-zA-Z]{3})(.)/,’\1 \2′).strip
self.add_datetime
end

def to_arr
return [self.artist, self.scene, self.date, self.datetime]
end

def add_datetime
@datetime = "#{@date_dict[@date[0,3].downcase]} #{@date[4..9]}"
end

end

concerts = {}

doc = Nokogiri::HTML(open(‘http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php’))
doc.css(‘.table li’).each do |el|
a = Concert.new(el.css(‘.name a’).first.content,
el.css(‘.scene’).first.content,
el.css(‘.date’).first.content)
concerts[a.artist] = a
end

CSV.open("output.csv", "wb") do |csv|
concerts.each do |k,v|
csv << concerts[k].to_arr if [‘Enga’,’Klubben’,’Sjøsiden’,’Vika’].include? concerts[k].scene
end
end

[/sourcecode]

Descriptive Statistical Methods in Code

Statistics can be mighty useful, and a little programming helps it getting even better. I often find that I through code can grasp the fundamental function behind how things work, and I have tried to apply this to statistics as well. In this case to descriptive statistics.

Join me in this short attempt to let statistics unconceal itself.

Getting ready: Extremistan and Mediocristan

This little text is looking closer on some of the descriptive methods of statistics, and to do that I want to use proper dataset. I have collected two data-sets following two different paradigms. The first dataset follows the rule of normal distribution/bell curve, derives from the physical world and is little susceptible to what Nassim Nicholas Taleb has identifies as the domain of Black Swans. In Mediocristan values seldom deviates, and when they do it’s not extremely.

On the other hand, our second dataset is from the social world where values does not conform to the bell curve, and where single values may change the whole dataset. In the domain of Extremistan values found includes many from the socio-cultural domains including number of cultural items sold, wealth, and number of Twitter followers.

Hights of 10 people  = 176, 182, 170, 186, 179, 183, 190, 183, 168, 180

Wealth of 10 people (imagined denomination) = 20, 100, 15, 5, 100, 10000, 1000, 30,  5, 200

To compute upon our values, we will save them into arrays.

[sourcecode language=”python”]

heights = [176, 182, 170, 186, 179, 183, 190, 183, 168, 180]
wealth = [20, 100, 15, 5, 100, 10000, 1000, 30, 5, 200]

[/sourcecode]

The count

This method is as simple as it gets. The count does not look at any of the numeric value stored, it just count the number of occurrences. In Python we have the comprehensive method len() to which we can pass the array and get returned the length.

[sourcecode language=”python”]

def count(data_arr):
return len(data_arr)

[/sourcecode]

The sum

To find the sum of the elements, we need to look in on the values and add these together. This is done by iterating through all the elements and add the value of all the elements to a single variable which we then return. Python has already a sum method implemented which does exactly what we need, but instead of using this we make our own: ownsum

[sourcecode language=”python”]

def ownsum(data_arr):
i = 0
for val in data_arr:
i += val
return i

[/sourcecode]

The min

All the values we have in both our arrays are comparable, the height is measured in centimetres and the wealth in a denomination. A higher number refers to more, and a lower number to less of that measure. The min, or minimum, finds the smallest value and returns this. To do this we need to start with the first number and always retain the smallest value we iterate over. In the implementation below the start value is set to the maximum value available to the system. To get this value will need to import the sys (system) library. Python already have a min method, so we implement our own: ownmin.

[sourcecode language=”python”]

import sys

def ownmin(data_arr):
i = sys.maxsize
for val in data_arr:
if val < i:
i = val
return i
[/sourcecode]

The max

The max method is somewhat a reversed min method. It looks at all the values and find the largest number. To get here we need to always retain the largest number as we iterate over the array. We start with the smallest possible value and then iterate. The max function is already implemented in Python, so we write our own: ownmax

[sourcecode language=”python”]

def ownmax(data_arr):
i = -sys.maxsize
for val in data_arr:
if val > i:
i = val
return i

[/sourcecode]

The range

The range is the numerical difference between the smallest and the largest value. This can give us a quick indication on the speadth of data. The range is fairly easy to find using subtraction. If we subtract the min from the max we are left with the range. We could write a whole function first performing the calculations for min and max and then do the subtraction, but since we already have implemented our own ownmin and ownmax methods, why not take advantage of our work already done. Python already has a function named range, so to avoid namespace problems let us implement the method ownrange

[sourcecode language=”python”]
def ownrange(data_arr):
return ownmax(data_arr) – ownmin(data_arr)
[/sourcecode]

The mean/arithmetic average

The mean, or the arithmetic average tells us something about the central tendency. The mode and the median average does so as well, but where the mode looks at occurrences and median look at the central element in an order list, the arithmetic average takes the sum of all elements and divide on the count. This is the most common way of finding the central tendency, but it is also prone to misunderstand large deviations in the dataset.

[sourcecode language=”python”]
def mean(data_arr):
return ownsum(data_arr) / count(data_arr)
[/sourcecode]

The median average

For the median average the values of each element is less importance, the paramount here is the position at where the elements can be found in relation to each other. The median is basically the middle element of an sorted array. If the array has an even number of elements the median is the aritmetic average of the two middle values. In the implementation of the method we first sort the array, then we get the length. If the length of the array is divisible by two (even number of elements) we return the average of the two middle values, if not we return the value of the middle element.

[sourcecode language=”python”]
def median(data_arr):
sorted_arr = sorted(data_arr)
length = len(sorted_arr)
if not length % 2:
return (sorted_arr[length / 2] + sorted_arr[length / 2 – 1]) / 2.0
return sorted_arr[length / 2]
[/sourcecode]

The mode average

The mode average is the most frequent value of the dataset. In our case this may not be representative, but if you have large datasets with many datapoints it makes more sense. The mode can be used trying to identify the distribution of the data. Often we have a mental anchor of the dataset having a unimodal distribution with a central peak with the most frequent occurrences of values in the centre and descending frequencies as we move away from the centre in both directions (e.g. the Empirical rule and bell curves), but datasets can also be bimodal with two centres towards each end of the spectre and few frequencies towards the centre. In a large dataset, the mode can help us spot these tendencies as it allows for more mode averages. For example a dataset with a range from 10 to 30 may have a unimodal distribution with a mode at 20, but also have a bimodal distribution with a mode at 15 and 25, this would be hard to spot for both the median and arithmetic averages.

To find the mode we need to create a dictionary. In this dictionary we create a key for each of the distinct values and a frequency of these. Once we have the dictionary, we need to iterate this to find the most frequent values. A mode can consist of more values, so we need to check the value of all the dictionary items against the max number of frequencies and return the element, or all elements that has this number of occurrences. Be aware that if none of the distinct values have more than one occurrences, all occurrences are returned.

[sourcecode language=”python”]
def mode(data_arr):
frequencies = {}
for i in data_arr:
frequencies[i] = frequencies.get(i, 0) + 1
max_freq = ownmax(frequencies.itervalues())
modes = [k for k, v in frequencies.iteritems() if v == max_freq]
return modes
[/sourcecode]

The variance of a population

The variance is used for finding the spread in the dataset. In the average methods above we found the central tendency of the dataset, and now we need to see how much the values of the dataset conforms to this number. For these values you can find whether you dataset is more extreme or mediocre (these are not binary groups, but gravitating centres). Once presented with an average from a dataset, a good follow up question is to ask for the standard deviation (next method). The variance is the sum of each values difference from the mean. To implement this method we first find the arithmetic mean for the dataset, and then iterate over all values comparing them to the average. Since the difference can be both higher and lower than the average, and we want to operate with positive numbers we square each value.

[sourcecode language=”python”]
def variance(data_arr):
variance = 0
average = mean(data_arr)
for i in data_arr:
variance += (average – i) **2
return variance / count(data_arr)
[/sourcecode]

The standard deviation

From the variance the way to the standard deviation is fairly simple. The standard deviation is the square root of the variance. This has the same measure as the average and is easier to compare. Here we take advantage of our last method and run a square root method on this to get the standard deviation. The sqrt method needs to be imported from the maths library. Be aware that standard deviation and variance is treated differently whether working with a sample or with a population. The differences is outside the scope of this article.

[sourcecode language=”python”]
import math
def stddev(data_arr):
return math.sqrt(variance(data_arr))
[/sourcecode]

The final code

We have now implemented some of the basic descriptive methods in statistics. The whole project can be found below. To run some of the functions. Call them by their method name and pass in an array. Two 10 elements long arrays can be found in the script, but the methods are implemented in such a way that you can pass with an array of any length.

[sourcecode language=”python”]

import sys
import math

heights = [176, 182, 170, 186, 179, 183, 190, 183, 168]
wealth = [20, 100, 15, 5, 100, 10000, 1000, 30, 5, 200]

def count(data_arr):
return len(data_arr)

def ownsum(data_arr):
i = 0
for val in data_arr:
i += val
return i

def ownmin(data_arr):
i = sys.maxsize
for val in data_arr:
if val < i: i = val return i def ownmax(data_arr): i = -sys.maxsize for val in data_arr: if val > i:
i = val
return i

def ownrange(data_arr):
return ownmax(data_arr) – ownmin(data_arr)

def mean(data_arr):
return ownsum(data_arr) / count(data_arr)

def median(data_arr):
sorted_arr = sorted(data_arr)
length = len(sorted_arr)
if not length % 2:
return (sorted_arr[length / 2] + sorted_arr[length / 2 – 1]) / 2.0
return sorted_arr[length / 2]

def mode(data_arr):
frequencies = {}
for i in data_arr:
frequencies[i] = frequencies.get(i, 0) + 1
max_freq = ownmax(frequencies.itervalues())
modes = [k for k, v in frequencies.iteritems() if v == max_freq]
return modes

def variance(data_arr):
variance = 0
average = mean(data_arr)
for i in data_arr:
variance += (average – i) **2
return variance / count(data_arr)

def stddev(data_arr):
return math.sqrt(variance(data_arr))

[/sourcecode]

Work programmatically with Google Spreadsheets

Some time back I authored a script which reads through a CSV formatted list, and based on the artists’ names tried to decide the nationality of the artists by querying the last.FM search engine and parsing the XML structured result.

The script worked, and found and returned about 80% of the artists, and around 80% of these again had the conceptually similar artist. What if the alteration could be done in the document itself? For the CSV based script the artists had to be extracted from the database, then parsed and then put into the document. With this attempt we can skip the middle step, and let the script run simultaneous with data being available to view for the users.

If you haven’t already used Google Docs, you should consider it, as it is a very convenient way of working with documents, especially where there are more users. Through Google Docs you can collaborate on working on documents, and since they are stored in clouds all users will instantly have the latest version. Another advantage is that whey you are working on the document simultaneously each user is editing the same document, so merging different versions is not a problem.

A great tool for combining Google Spreadsheets with the Python programming environment is the gspread module. After installing this package you only need to import the code and then you can with very few lines of code retrieve and update information in the spreadsheet cells.

The specifics for doing this task is pretty much these few lines. Import the package. Login, find correct document and do whatever you need to. (this excerpt won’t work. Check the whole script under – NB: mind the indentation. It may not be correct displayed in the browser.)

[sourcecode language=”python”]
import gspread

G_USERNAME = ‘your@gmail.com’
G_PASSWORD = ‘yourPassword’
G_IDENTIFIER = ‘document_identifier_checkdocument_url_in_your_browser’

# Connecting to the data source
gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

[/sourcecode]

Above is the lines related to connecting Python to the google docs, under you can see the whole script and how the method I mentioned in a post earlier is used in this setting.

[sourcecode language=”python”]
#!/usr/bin/python
# -*- coding:utf-8 -*-

"""
Clouds &amp; Concerts – 2012
Ola Loevholm

Initialized from the commandline. Runs through The Google doc spreadsheet with topp 5000 artists, and
runs the parsing query against the Last.FM browser then enters the country and search string (for validation)
into the google docs.

"""

G_USERNAME = ‘your@gmail.com’
G_PASSWORD = ‘yourPassword’
G_IDENTIFIER = ‘document_identifier_checkdocument_url_in_your_browser’

import sys, urllib, string, csv, time
import xml.etree.ElementTree as ET
import gspread

# Loads a dictionary with ISO 3166-1 abbreviations and codes
COUNTRIES = {"AF":"AFGHANISTAN","AX":"ÅLAND ISLANDS","AL":"ALBANIA","DZ":"ALGERIA","AS":"AMERICAN SAMOA","AD":"ANDORRA","AO":"ANGOLA","AI":"ANGUILLA","AQ":"ANTARCTICA","AG":"ANTIGUA AND BARBUDA","AR":"ARGENTINA","AM":"ARMENIA","AW":"ARUBA","AU":"AUSTRALIA","AT":"AUSTRIA","AZ":"AZERBAIJAN","BS":"BAHAMAS","BH":"BAHRAIN","BD":"BANGLADESH","BB":"BARBADOS","BY":"BELARUS","BE":"BELGIUM","BZ":"BELIZE","BJ":"BENIN","BM":"BERMUDA","BT":"BHUTAN","BO":"BOLIVIA, PLURINATIONAL STATE OF","BQ":"BONAIRE, SINT EUSTATIUS AND SABA","BA":"BOSNIA AND HERZEGOVINA","BW":"BOTSWANA","BV":"BOUVET ISLAND","BR":"BRAZIL","IO":"BRITISH INDIAN OCEAN TERRITORY","BN":"BRUNEI DARUSSALAM","BG":"BULGARIA","BF":"BURKINA FASO","BI":"BURUNDI","KH":"CAMBODIA","CM":"CAMEROON","CA":"CANADA","CV":"CAPE VERDE","KY":"CAYMAN ISLANDS","CF":"CENTRAL AFRICAN REPUBLIC","TD":"CHAD","CL":"CHILE","CN":"CHINA","CX":"CHRISTMAS ISLAND",
"CC":"COCOS (KEELING) ISLANDS","CO":"COLOMBIA","KM":"COMOROS","CG":"CONGO","CD":"CONGO, THE DEMOCRATIC REPUBLIC OF THE","CK":"COOK ISLANDS","CR":"COSTA RICA","CI":"CÔTE D’IVOIRE","HR":"CROATIA","CU":"CUBA","CW":"CURAÇAO","CY":"CYPRUS","CZ":"CZECH REPUBLIC","DK":"DENMARK","DJ":"DJIBOUTI","DM":"DOMINICA","DO":"DOMINICAN REPUBLIC","EC":"ECUADOR","EG":"EGYPT","SV":"EL SALVADOR","GQ":"EQUATORIAL GUINEA","ER":"ERITREA","EE":"ESTONIA","ET":"ETHIOPIA","FK":"FALKLAND ISLANDS (MALVINAS)","FO":"FAROE ISLANDS","FJ":"FIJI","FI":"FINLAND","FR":"FRANCE","GF":"FRENCH GUIANA","PF":"FRENCH POLYNESIA","TF":"FRENCH SOUTHERN TERRITORIES","GA":"GABON","GM":"GAMBIA","GE":"GEORGIA","DE":"GERMANY","GH":"GHANA","GI":"GIBRALTAR","GR":"GREECE","GL":"GREENLAND","GD":"GRENADA","GP":"GUADELOUPE","GU":"GUAM","GT":"GUATEMALA","GG":"GUERNSEY","GN":"GUINEA","GW":"GUINEA-BISSAU","GY":"GUYANA","HT":"HAITI","HM":"HEARD ISLAND AND MCDONALD ISLANDS",
"VA":"HOLY SEE (VATICAN CITY STATE)","HN":"HONDURAS","HK":"HONG KONG","HU":"HUNGARY","IS":"ICELAND","IN":"INDIA","ID":"INDONESIA","IR":"IRAN, ISLAMIC REPUBLIC OF","IQ":"IRAQ","IE":"IRELAND","IM":"ISLE OF MAN","IL":"ISRAEL","IT":"ITALY","JM":"JAMAICA","JP":"JAPAN","JE":"JERSEY","JO":"JORDAN","KZ":"KAZAKHSTAN","KE":"KENYA","KI":"KIRIBATI","KP":"KOREA, DEMOCRATIC PEOPLE’S REPUBLIC OF","KR":"KOREA, REPUBLIC OF","KW":"KUWAIT","KG":"KYRGYZSTAN","LA":"LAO PEOPLE’S DEMOCRATIC REPUBLIC","LV":"LATVIA","LB":"LEBANON","LS":"LESOTHO","LR":"LIBERIA","LY":"LIBYA","LI":"LIECHTENSTEIN","LT":"LITHUANIA","LU":"LUXEMBOURG","MO":"MACAO","MK":"MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF","MG":"MADAGASCAR","MW":"MALAWI","MY":"MALAYSIA","MV":"MALDIVES","ML":"MALI","MT":"MALTA","MH":"MARSHALL ISLANDS","MQ":"MARTINIQUE","MR":"MAURITANIA","MU":"MAURITIUS","YT":"MAYOTTE","MX":"MEXICO","FM":"MICRONESIA, FEDERATED STATES OF",
"MD":"MOLDOVA, REPUBLIC OF","MC":"MONACO","MN":"MONGOLIA","ME":"MONTENEGRO","MS":"MONTSERRAT","MA":"MOROCCO","MZ":"MOZAMBIQUE","MM":"MYANMAR","NA":"NAMIBIA","NR":"NAURU","NP":"NEPAL","NL":"NETHERLANDS","NC":"NEW CALEDONIA","NZ":"NEW ZEALAND","NI":"NICARAGUA","NE":"NIGER","NG":"NIGERIA","NU":"NIUE","NF":"NORFOLK ISLAND","MP":"NORTHERN MARIANA ISLANDS","NO":"NORWAY","OM":"OMAN","PK":"PAKISTAN","PW":"PALAU","PS":"PALESTINIAN TERRITORY, OCCUPIED","PA":"PANAMA","PG":"PAPUA NEW GUINEA","PY":"PARAGUAY","PE":"PERU","PH":"PHILIPPINES","PN":"PITCAIRN","PL":"POLAND","PT":"PORTUGAL","PR":"PUERTO RICO","QA":"QATAR","RE":"RÉUNION","RO":"ROMANIA","RU":"RUSSIAN FEDERATION","RW":"RWANDA","BL":"SAINT BARTHÉLEMY","SH":"SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA","KN":"SAINT KITTS AND NEVIS","LC":"SAINT LUCIA","MF":"SAINT MARTIN (FRENCH PART)","PM":"SAINT PIERRE AND MIQUELON","VC":"SAINT VINCENT AND THE GRENADINES",
"WS":"SAMOA","SM":"SAN MARINO","ST":"SAO TOME AND PRINCIPE","SA":"SAUDI ARABIA","SN":"SENEGAL","RS":"SERBIA","SC":"SEYCHELLES","SL":"SIERRA LEONE","SG":"SINGAPORE","SX":"SINT MAARTEN (DUTCH PART)","SK":"SLOVAKIA","SI":"SLOVENIA","SB":"SOLOMON ISLANDS","SO":"SOMALIA","ZA":"SOUTH AFRICA","GS":"SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS","SS":"SOUTH SUDAN","ES":"SPAIN","LK":"SRI LANKA","SD":"SUDAN","SR":"SURINAME","SJ":"SVALBARD AND JAN MAYEN","SZ":"SWAZILAND","SE":"SWEDEN","CH":"SWITZERLAND","SY":"SYRIAN ARAB REPUBLIC","TW":"TAIWAN, PROVINCE OF CHINA","TJ":"TAJIKISTAN","TZ":"TANZANIA, UNITED REPUBLIC OF","TH":"THAILAND","TL":"TIMOR-LESTE","TG":"TOGO","TK":"TOKELAU","TO":"TONGA","TT":"TRINIDAD AND TOBAGO","TN":"TUNISIA","TR":"TURKEY","TM":"TURKMENISTAN","TC":"TURKS AND CAICOS ISLANDS","TV":"TUVALU","UG":"UGANDA","UA":"UKRAINE","AE":"UNITED ARAB EMIRATES","GB":"UNITED KINGDOM","US":"UNITED STATES",
"UM":"UNITED STATES MINOR OUTLYING ISLANDS","UY":"URUGUAY","UZ":"UZBEKISTAN","VU":"VANUATU","VE":"VENEZUELA, BOLIVARIAN REPUBLIC OF","VN":"VIET NAM","VG":"VIRGIN ISLANDS, BRITISH","VI":"VIRGIN ISLANDS, U.S.","WF":"WALLIS AND FUTUNA","EH":"WESTERN SAHARA","YE":"YEMEN","ZM":"ZAMBIA","ZW":"ZIMBABWE"}

&nbsp;

# Connecting to the data source
gc = gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)
# Iterates through XML-structure and removes the namespace, for easier navigation in getCountry()s ElementTree.findall()
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}’ % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
# getCountry – where the magic happens. Encodes string with artistname to url, then query musicbrainz search engine.
# parses the XML-answer and get the name, id and country of the first returned element (with highest weight)
# returns country name i a) artist is found through the search engine b) artist has a country associated to the profile, otherwise returns False
def getCountry(name):
name = urllib.quote_plus(name)
BASE_URL = "http://musicbrainz.org/ws/2/artist/?query=%s&amp;format=xml&amp;method=advanced" % (name)
print "Querying: %s" % (BASE_URL)
try:
search_input = urllib.urlopen(BASE_URL)
# Checks whether HTTP Request Code is 200 – if not goes to sleep for 5 seconds // Inded for 503 Code
http_code = search_input.code
if http_code != 200:
# print "Could not access: %s \t Got HTTP Code: %s. 5 second cool-down" % (name, http_code)
time.sleep(5)
getCountry(name)
except Exception:
print "GETTING_ERROR: Something went wrong while getting HTTP"
return False
#search_xml = search_input.read()
#print search_xml
try:
tree = ET.parse(search_input)
remove_namespace(tree, u’http://musicbrainz.org/ns/mmd-2.0#’)
feed = tree.getroot()
elem = feed.findall("./artist-list/")
#print elem[0].find(‘name’).text
#print elem[0].get(‘id’)
except Exception:
print "PARSE_ERROR: Something went wrong while parsing HTTP"
return False
try:
if elem[0].find(‘country’) != None:
# print COUNTRIES[elem[0].find(‘country’).text]
try:
country = COUNTRIES[elem[0].find(‘country’).text]
except Exception:
print "Could not find key in countrylist error"
return False
return [country,elem[0].get(‘id’),elem[0].find(‘name’).text]
else:
print elem[0].find(‘name’).text + " has not any country associated\n"
return False
except IndexError, ValueError:
print "ERROR – COULD NOT GET DATA FROM %s\n" % (name)
return False

#Running through the values
get_val = []
set_name = []
set_country = []
for a in range(2,5124):
v = "B%s" % a
sn = "H%s" % a
sc = "G%s" % a
get_val.append(v)
set_name.append(sn)
set_country.append(sc)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

[/sourcecode]

Data-wrangling: find country based on artist name

At the Clouds & Concerts project at the University of Oslo we are working with really interesting topics, based on interesting empirical data. Through our collaboration with the Norwegian streaming service provider WiMP we are together with Telenor and WiMP analysing a vast collection of data. More about the project’s data-part, also the ‘Clouds’ part of the project’s name can be found on the project’s web sites.

Artist and Country

One of the tasks at hand was to find out which country an artist came from, and whether they came from Norway or not. One way of doing this is to manually go through each artist and use preexisting knowledge about music to determine their country of origin, if stuck, use online services (aren’t we all mostly using Google as initial source of wisdom). Another alternative is to use online services first and then to use human preexisting knowledge to quality assure the final result.

On the Internet, vast amount of sources can be found. However, if you want to get the data without too much fitting, testing, and nitty gritty adaptation for every source, you have an advantage if there is as consistently structured repository you can tap from. Luckily, Metabrainz foundation has a large repository of musical meta-data known as Musicbrainz.

Below you find a script which should (partially) solve our problem by combining the data from Musicbrainz with data exported from our research data. That being said, this script is more a method than a finished product. It should be very easy adaptable, but it is an advantage if you know Python and handling CSV-files. Codeacademy has a good introduction to Python.

The core idea of the script is to take input with name and number of streams and turn them into output with name from the original datasource, number of streams from the original datasource, as well as country of origin, Musicbrainz-ID, and name parsed by Musicbrainz search engine (for initial quality assurance).

To make things simple there is only one successful output and that is if the name sent to the Musicbrainz search engine return an answer, and if that answer has a country associated with it. Be advised (that is also why I have marked the title with ‘try’), that the search engine may not return a similar result. Of that reason we also print the name of the artist we find so this later can be juxtaposed with the original name in the Excel spreadsheet (you are going to transform the CSV to Excel before reviewing aren’t you. Good tool is Google Refine). Another problem is that popular cultural phenomena, common nouns and tribute bands (probably in that order, descending) have same name. This is why a human is always needed, or semantic absolute URI associated with each phenomena. This leads me on to the last step before the code.

Other ways this could have been solved (let me know if you solve the problem in any of these ways)

The semantic way:

The data found in the Musicbrainz database is made available through a SPARQL endpoint named LinkedBrainz.If you know the right ontologies and is comfortable with performing triplestore queries, this is perhaps the most innovative and new-thinking way to solve the problem.

The Virtual Machine Postgres way:

Instead of doing a query on the server, you can be a gentleman and download the server onto your own machine. If you have VirtualBox (if you don’t have it, download it for free) you can run the server locally. An image file with the complete Musicbrainz database can be found on their webpages.

The code:

Here is the code used to solve this task. It can also be cloned from the Cloud & Concerts GitHub-page

[sourcecode language=”python”]
#!/usr/bin/python
# -*- coding:utf-8 -*-

"""
Clouds & Concerts – 2012
Ola Loevholm

Called from command line:
The script reads a file named "topp1000_artister.csv" consisting of a list of artists and then tries to find out which country each artist comes from based on the name.
The name is given in the second column of the CSV file.

Called as a module:
The method getCountry() takes an artist name and checks this with musicbrainz seach engine. Returns the country if a) artist is found through the search engine b) artist has a country associated to the profile

"""

import sys, urllib, string, csv, time
import xml.etree.ElementTree as ET

# Loads a dictionary with ISO 3166-1 abbreviations and countries
COUNTRIES = {"AF":"AFGHANISTAN","AX":"ÅLAND ISLANDS","AL":"ALBANIA","DZ":"ALGERIA","AS":"AMERICAN SAMOA","AD":"ANDORRA","AO":"ANGOLA","AI":"ANGUILLA","AQ":"ANTARCTICA","AG":"ANTIGUA AND BARBUDA","AR":"ARGENTINA","AM":"ARMENIA","AW":"ARUBA","AU":"AUSTRALIA","AT":"AUSTRIA","AZ":"AZERBAIJAN","BS":"BAHAMAS","BH":"BAHRAIN","BD":"BANGLADESH","BB":"BARBADOS","BY":"BELARUS","BE":"BELGIUM","BZ":"BELIZE","BJ":"BENIN","BM":"BERMUDA","BT":"BHUTAN","BO":"BOLIVIA, PLURINATIONAL STATE OF","BQ":"BONAIRE, SINT EUSTATIUS AND SABA","BA":"BOSNIA AND HERZEGOVINA","BW":"BOTSWANA","BV":"BOUVET ISLAND","BR":"BRAZIL","IO":"BRITISH INDIAN OCEAN TERRITORY","BN":"BRUNEI DARUSSALAM","BG":"BULGARIA","BF":"BURKINA FASO","BI":"BURUNDI","KH":"CAMBODIA","CM":"CAMEROON","CA":"CANADA","CV":"CAPE VERDE","KY":"CAYMAN ISLANDS","CF":"CENTRAL AFRICAN REPUBLIC","TD":"CHAD","CL":"CHILE","CN":"CHINA","CX":"CHRISTMAS ISLAND",
"CC":"COCOS (KEELING) ISLANDS","CO":"COLOMBIA","KM":"COMOROS","CG":"CONGO","CD":"CONGO, THE DEMOCRATIC REPUBLIC OF THE","CK":"COOK ISLANDS","CR":"COSTA RICA","CI":"CÔTE D’IVOIRE","HR":"CROATIA","CU":"CUBA","CW":"CURAÇAO","CY":"CYPRUS","CZ":"CZECH REPUBLIC","DK":"DENMARK","DJ":"DJIBOUTI","DM":"DOMINICA","DO":"DOMINICAN REPUBLIC","EC":"ECUADOR","EG":"EGYPT","SV":"EL SALVADOR","GQ":"EQUATORIAL GUINEA","ER":"ERITREA","EE":"ESTONIA","ET":"ETHIOPIA","FK":"FALKLAND ISLANDS (MALVINAS)","FO":"FAROE ISLANDS","FJ":"FIJI","FI":"FINLAND","FR":"FRANCE","GF":"FRENCH GUIANA","PF":"FRENCH POLYNESIA","TF":"FRENCH SOUTHERN TERRITORIES","GA":"GABON","GM":"GAMBIA","GE":"GEORGIA","DE":"GERMANY","GH":"GHANA","GI":"GIBRALTAR","GR":"GREECE","GL":"GREENLAND","GD":"GRENADA","GP":"GUADELOUPE","GU":"GUAM","GT":"GUATEMALA","GG":"GUERNSEY","GN":"GUINEA","GW":"GUINEA-BISSAU","GY":"GUYANA","HT":"HAITI","HM":"HEARD ISLAND AND MCDONALD ISLANDS",
"VA":"HOLY SEE (VATICAN CITY STATE)","HN":"HONDURAS","HK":"HONG KONG","HU":"HUNGARY","IS":"ICELAND","IN":"INDIA","ID":"INDONESIA","IR":"IRAN, ISLAMIC REPUBLIC OF","IQ":"IRAQ","IE":"IRELAND","IM":"ISLE OF MAN","IL":"ISRAEL","IT":"ITALY","JM":"JAMAICA","JP":"JAPAN","JE":"JERSEY","JO":"JORDAN","KZ":"KAZAKHSTAN","KE":"KENYA","KI":"KIRIBATI","KP":"KOREA, DEMOCRATIC PEOPLE’S REPUBLIC OF","KR":"KOREA, REPUBLIC OF","KW":"KUWAIT","KG":"KYRGYZSTAN","LA":"LAO PEOPLE’S DEMOCRATIC REPUBLIC","LV":"LATVIA","LB":"LEBANON","LS":"LESOTHO","LR":"LIBERIA","LY":"LIBYA","LI":"LIECHTENSTEIN","LT":"LITHUANIA","LU":"LUXEMBOURG","MO":"MACAO","MK":"MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF","MG":"MADAGASCAR","MW":"MALAWI","MY":"MALAYSIA","MV":"MALDIVES","ML":"MALI","MT":"MALTA","MH":"MARSHALL ISLANDS","MQ":"MARTINIQUE","MR":"MAURITANIA","MU":"MAURITIUS","YT":"MAYOTTE","MX":"MEXICO","FM":"MICRONESIA, FEDERATED STATES OF",
"MD":"MOLDOVA, REPUBLIC OF","MC":"MONACO","MN":"MONGOLIA","ME":"MONTENEGRO","MS":"MONTSERRAT","MA":"MOROCCO","MZ":"MOZAMBIQUE","MM":"MYANMAR","NA":"NAMIBIA","NR":"NAURU","NP":"NEPAL","NL":"NETHERLANDS","NC":"NEW CALEDONIA","NZ":"NEW ZEALAND","NI":"NICARAGUA","NE":"NIGER","NG":"NIGERIA","NU":"NIUE","NF":"NORFOLK ISLAND","MP":"NORTHERN MARIANA ISLANDS","NO":"NORWAY","OM":"OMAN","PK":"PAKISTAN","PW":"PALAU","PS":"PALESTINIAN TERRITORY, OCCUPIED","PA":"PANAMA","PG":"PAPUA NEW GUINEA","PY":"PARAGUAY","PE":"PERU","PH":"PHILIPPINES","PN":"PITCAIRN","PL":"POLAND","PT":"PORTUGAL","PR":"PUERTO RICO","QA":"QATAR","RE":"RÉUNION","RO":"ROMANIA","RU":"RUSSIAN FEDERATION","RW":"RWANDA","BL":"SAINT BARTHÉLEMY","SH":"SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA","KN":"SAINT KITTS AND NEVIS","LC":"SAINT LUCIA","MF":"SAINT MARTIN (FRENCH PART)","PM":"SAINT PIERRE AND MIQUELON","VC":"SAINT VINCENT AND THE GRENADINES",
"WS":"SAMOA","SM":"SAN MARINO","ST":"SAO TOME AND PRINCIPE","SA":"SAUDI ARABIA","SN":"SENEGAL","RS":"SERBIA","SC":"SEYCHELLES","SL":"SIERRA LEONE","SG":"SINGAPORE","SX":"SINT MAARTEN (DUTCH PART)","SK":"SLOVAKIA","SI":"SLOVENIA","SB":"SOLOMON ISLANDS","SO":"SOMALIA","ZA":"SOUTH AFRICA","GS":"SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS","SS":"SOUTH SUDAN","ES":"SPAIN","LK":"SRI LANKA","SD":"SUDAN","SR":"SURINAME","SJ":"SVALBARD AND JAN MAYEN","SZ":"SWAZILAND","SE":"SWEDEN","CH":"SWITZERLAND","SY":"SYRIAN ARAB REPUBLIC","TW":"TAIWAN, PROVINCE OF CHINA","TJ":"TAJIKISTAN","TZ":"TANZANIA, UNITED REPUBLIC OF","TH":"THAILAND","TL":"TIMOR-LESTE","TG":"TOGO","TK":"TOKELAU","TO":"TONGA","TT":"TRINIDAD AND TOBAGO","TN":"TUNISIA","TR":"TURKEY","TM":"TURKMENISTAN","TC":"TURKS AND CAICOS ISLANDS","TV":"TUVALU","UG":"UGANDA","UA":"UKRAINE","AE":"UNITED ARAB EMIRATES","GB":"UNITED KINGDOM","US":"UNITED STATES",
"UM":"UNITED STATES MINOR OUTLYING ISLANDS","UY":"URUGUAY","UZ":"UZBEKISTAN","VU":"VANUATU","VE":"VENEZUELA, BOLIVARIAN REPUBLIC OF","VN":"VIET NAM","VG":"VIRGIN ISLANDS, BRITISH","VI":"VIRGIN ISLANDS, U.S.","WF":"WALLIS AND FUTUNA","EH":"WESTERN SAHARA","YE":"YEMEN","ZM":"ZAMBIA","ZW":"ZIMBABWE"}

# Iterates through XML-structure and removes the namespace, for easier navigation in getCountry()s ElementTree.findall()
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}’ % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]

# getCountry – where the magic happens. Encodes string with artistname to url, then query musicbrainz search engine.
# parses the XML-answer and get the name, id and country of the first returned element (with highest weight)
# returns country name i a) artist is found through the search engine b) artist has a country associated to the profile, otherwise returns False
def getCountry(name):
name = urllib.quote_plus(name)
BASE_URL = "http://musicbrainz.org/ws/2/artist/?query=%s&format=xml&method=advanced" % (name)
print "Querying: %s" % (BASE_URL)
try:
search_input = urllib.urlopen(BASE_URL)
# Checks whether HTTP Request Code is 200 – if not goes to sleep for 5 seconds // Inded for 503 Code
http_code = search_input.code
if http_code != 200:
# print "Could not access: %s \t Got HTTP Code: %s. 5 second cool-down" % (name, http_code)
time.sleep(5)
getCountry(name)
except Exception:
print "GETTING_ERROR: Something went wrong while getting HTTP"
return False
#search_xml = search_input.read()
#print search_xml
try:
tree = ET.parse(search_input)
remove_namespace(tree, u’http://musicbrainz.org/ns/mmd-2.0#’)
feed = tree.getroot()
elem = feed.findall("./artist-list/")
#print elem[0].find(‘name’).text
#print elem[0].get(‘id’)
except Exception:
print "PARSE_ERROR: Something went wrong while parsing HTTP"
return False
try:
if elem[0].find(‘country’) != None:
# print COUNTRIES[elem[0].find(‘country’).text]
try:
country = COUNTRIES[elem[0].find(‘country’).text]
except Exception:
print "Could not find key in countrylist error"
return False
return [country,elem[0].get(‘id’),elem[0].find(‘name’).text]
else:
print elem[0].find(‘name’).text + " has not any country associated\n"
return False
except IndexError, ValueError:
print "ERROR – COULD NOT GET DATA FROM %s\n" % (name)
return False

# If method is called from terminal. Iterates through topp1000 artists contained in a CSV-file in same directory.
if __name__ == "__main__":
#name = sys.argv[1]
csvfile = open("topp1000_artister.csv")
outfile = open("topp1000_output.csv","w")
artistlist = csv.reader(csvfile, delimiter=’,’, quotechar=’"’)
for line in artistlist:
result = getCountry(line[1])
try:
if result != False:
result_string = "%s,%s,%s,%s,%s,%s\n" % (line[0],line[1],line[2],result[0],result[1],result[2])
# print result_string
else:
result_string = "%s,%s,%s,%s\n" % (line[0],line[1],line[2],"No Country Found or fail occured")
# print result_string
except IndexError, ValueError:
print e
result_string = "Error on element: %s\n" % line[1]
try:
outfile.write(result_string)
except:
print "Write error happened with %s" % line[1]
[/sourcecode]

And as always, I am most greatful for feedback! Hope this may come in handy!