Browsed by
Tag: Python

A short script for testing writing many files to a folder

A short script for testing writing many files to a folder

The challenge: We want to see when the number of files in a folder decrease the performance on adding new files into the same folder. Two examples where we may need to do to this are: to get an overview of the performance of the file system node structure, or to test Windows function for 8dot3 format compatibility.

The solution: We want to create a script that writes a large amount of files to the folder in question and is logging the time taken at specific milestones. The records logged from the execution of this script can give us a time on how long it takes to write the number of files until the milestones are reached, and from this we can infer how efficient the file system is at writing files between the different milestones.

Example of output
A graph representing the number of files created over time. The X axis convey the number of seconds elapsed, and the Y axis the number of files created. How does your function look like?

The implementation: I’ve chose to set the creation of new files in a for loop which runs N times based on user input. The loop will start, open a new file with an incremental file name, write the payload to the file, and finally close the file and increment the loop counter.

Wrapped around this core functionality, we need to define into which folder the files will be created, what data is to be read and written. We need to read the defined data into a variable (we don’t want to attach too much overhead by reading the data-to-write for each iteration), create a test-folder if this is not already excising. In addition we need a function to write the timestamp, and the iteration number to a file.

To open for multiprocessor testing I’ve also add a loop for spawning new processes and passing on the data about the number of files, and to test for more scenarios e.g. renaming and deleting files, more actions have been added.

The actions, the test folder path, the input file and the number of files and processors are something which the user most likely will change frequently, so instead of keeping this hard coded in the code this is branched out to be provided by the user as command line arguments. As always when dealing with command line arguments: provide good defaults, the user is often likely not to use all the parameters editable.

From description to code this will look something like this:

import time
import os
import string
import random
from multiprocessing import Process
import multiprocessing
import optparse
import os.path

def main(files_each=100, processes=10, actions="a", log_interval=100, temp_path="temp_files", infile="infile.txt"):
  path = temp_path
  check_and_create_folder_path(path)
  for i in range(processes):
    p = Process(target=spawnTask, args=(path, files_each, actions, log_interval, infile))
    p.start()

def print_time_delta(start_time, comment, outfile=False):
  if not outfile:
    print(comment," | ",time.time() - start_time, " seconds")
  else:
    with open(outfile, 'a+') as out:
      out.write("{0} | {1} \n".format(time.time() - start_time, comment))

def spawnTask(path,files_each, actions,log_interval, infile):
  start_time = time.time()
  content = read_file_data(infile)

  print_time_delta(start_time,"creating files for process: "+str(os.getpid()))
  created_files = createfiles(files_each, content,path,start_time, log_interval)
  if(actions == 'a' or actions == 'cr'):
    print_time_delta(start_time,"renaming files for process: " +str(os.getpid()))
    renamed_files = rename_files(created_files,path,start_time, log_interval)
  if(actions == 'a'):
    print_time_delta(start_time,"deleting files for process: "+str(os.getpid()))
    delete_files(renamed_files,path,start_time, log_interval)

  print_time_delta(start_time,"operations have ended. Terminating process:"+str(os.getpid()))

def createfiles(number_of_files, content,path,start_time, log_interval):
  own_pid = str(os.getpid())
  created_files = []
  for i in range(number_of_files):
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"create","prod_log.txt")
      filename = "wordfile_test_"+"_"+own_pid+"_"+str(i)+".docx"
      created_files.append(filename)
      with open(path+"\\"+filename,"wb") as print_file:
        print_file.write(content)

  print_time_delta(start_time, str(number_of_files) +" | "+own_pid+" | "+"create","prod_log.txt")

  return created_files

def rename_files(filenames,path,start_time, log_interval):
  new_filenames = []
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"rename","prod_log.txt")
      lst =[random.choice(string.ascii_letters + string.digits) for n in range(30)]
      text = "".join(lst)
      os.rename(path+"\\"+file,path+"\\"+text+".docx")
      new_filenames.append(text+".docx")
      i += 1

  print_time_delta(start_time, str(len(new_filenames))+" | "+own_pid+" | "+"rename","prod_log.txt")

return new_filenames

def delete_files(filenames,path,start_time, log_interval):
  num_files = len(filenames)
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"delete","prod_log.txt")
      os.remove(path+"\\"+file)
      i += 1
      print_time_delta(start_time, str(num_files)+" | "+own_pid+" | "+"delete","prod_log.txt")

def check_and_create_folder_path(path):
  if not os.path.exists(path):
    os.makedirs(path)

def read_file_data(infile):
  with open(infile,"rb") as content_file:
    content = content_file.read()
  return content

if __name__ == "__main__":
  multiprocessing.freeze_support()
  parser = optparse.OptionParser()
  parser.add_option('-f', '--files', default=100, help="The number of files each process should create. Default is 100")
  parser.add_option('-p', '--processes', default=10, help="The number of processes the program should create. Default is 10")
  parser.add_option('-a', '--action', default='a', help="The action which the program should perform. The default is a.\n Opions include a (all), c (create), cr (create and rename)")
  parser.add_option('-l', '--log_interval', default=100, help="The interval between when a process is logging files created. Default is 100")
  parser.add_option('-t', '--temp_path', default="temp_files", help="Path where the file processes will be done")
  parser.add_option('-i', '--infile', default="infile.txt", help="The file which will be used in the test")

  options, args = parser.parse_args()
  main(int(options.files), int(options.processes), options.action, int(options.log_interval), options.temp_path, options.infile)

 

 

sample_from_output

The output from running this script will be a pipe separated (‘|’) list with seconds, number of files, the process ID (since we enable the program to spawn and run similar processes simultaneously we need to have a way to identify the processes) and actions. This will look like the image below, and from this number you can create statistics on performance at different folder sizes.

The idea of performing this analysis and valuable feedback in the process came from great colleagues at Steria AS.  Any issues, problems, responsibilities etc. with the code or text are solely my own. Whatever you use this information to do, try out or anything is solely your own responsibility.

The folder image is by Erik Yeoh and is released under a Creative Commons Attribution-NonCommercial-ShareAlike License. The image can be found on Flickr.

Work programmatically with Google Spreadsheets Part 2

Work programmatically with Google Spreadsheets Part 2

A while back I wrote a short post on how you can write and read to Google Spreadsheets programmatically using Python and the package ‘gspread’.

Last time the reading was done by first creating arrays with the addresses to where the values could be found in the spreadsheet, and then run through all the values and replace the addresses with the values. It worked fine, but it’s not best practice or very efficient as it makes many single requests on the API. In part two, I will share a short tip on how to read the values in one go instead of iterating through a range of values.

Here is the excerpt dealing with retrieving values. (NB: see original blogpost for gspread initialization).

#Running through the values
get_val = []
set_name = []
set_country = []
for a in range(2,5124):
v = "B%s" % a
sn = "H%s" % a
sc = "G%s" % a
get_val.append(v)
set_name.append(sn)
set_country.append(sc)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

In a recent script we only wanted to download values from a Google spreadsheet (yes, we could have exported the files to .csv with similar result, but with a script we may expand and parse if needed), and this gave some time for refactoring the code as well.

The gspread function worksheet.get_all_values() returns a list of lists with the values. The outer list contains the rows, and the row list contains the specific value of the column at the numerical value for the column. In this example num_streams is the second column, and the position is hence [1] as the list starts at zero.

Also note the nifty way of writing utf-8 formatted strings to the file. UTF-8 can often cause an headache, but put a “u”-character before the string and open the stream with codecs.open(“filename”,”mode”,”encoding”).

The new way of retrieving data from a Google Docs Spreadsheet:

# -*- coding: UTF-8 -*-
import gspread
import codecs

# Global variables for accessing resource
G_USERNAME = 'user_email'
G_PASSWORD = 'password'
G_IDENTIFIER = 'spreadsheet_identifier'

# Connecting to the data source
gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)

all_val = worksheet.get_all_values()

output = codecs.open('output_norwegian_artists.csv','wb', "utf-8-sig")

for l in all_val:
num_streams, artistid, name = (l[1],l[2],l[3])
norwegian = l[4]
if len(norwegian) < 3:
norwegian = 'NULL'

string = u"{0};{1};{2};{3}\n".format(num_streams, artistid, name, norwegian)
output.write(string)

output.close()

 

Picture licensed under a creative commons attribution license by the brilliant organization Open Knowledge Foundation. Picture retrieved through a CC-search on Flickr

Sanntid hjemme: Følg Ruter hjemmefra med Arduino og Python

Sanntid hjemme: Følg Ruter hjemmefra med Arduino og Python

Sanntidsskjermene som Ruter har satt opp ved flere t-bane, trikk og bussholdeplasser har blitt et kjærkomment bidrag til informasjonen du som reisende får om avganger. Ikke bare kan du ha stålkontroll over hvilke avganger som går, og i hvilken rekkefølge, du blir også oppdatert om estimert ankomsttid slik at du vet at du mista bussen du løp for å rekke eller om bussen har blitt forsinka i rush-trafikken.

I tillegg til skjermene på holdeplassene som opplyser om sanntid, kan du få tak i app-er til mobiltelefonen og du kan også se dataene på forskjellige tredjepartsløsninger som f.eks på informasjonsskjermene ved Instiutt for informatikk ved Universitetet i Oslo. At dataene kan vises på mange forskjellige steder er mulig siden Ruter legger ut all ruteinformasjon til fri bruk (noen restriksjoner gjelder) på nettet. Det er dette vi kan benytte oss av når vi skal lage vårt eget sanntidssystem.

Informasjonsskjerm med sanntidsinformasjon

Vårt ferdige produkt vil være et sanntidssystem, ikke ulikt det man kan finne på holdeplassene. Selvsagt blir vårt produkt mindre robust og pålitelig, men i prototypings-fasen er dette helt greit. Skjermen skal vise hvilken linje neste avgang gjelder, hvilket navn det er på denne linjas endestasjon og hvor mange minutter det er til ankomst/avgang. For å få til dette benytter vi oss av en “dum” skjerm og av programvare som skriver teksten til denne. Skjermen vi bruker kobler vi til datamaskinen igjennom et Arduino grensesnitt. Vi bruker datamaskinen for å gjøre det meste av jobben, og vi legger mest mulig kompleksitet hit.

Informasjonsskjermen

Arduino er et veldig bra sted å begynne for å lære elektronikk, og hvordan vi kan kombinere dette med datamaskiner. Arduino har mange pedagogiske resursser tilgjengelig, og er laget for å være enkelt å komme igang med samt fleksibelt nok til å kunne skreddersys med egne komponenter.

Alle komponentene vi trenger følger med i Arduino innførings-pakke (Starter Kit). Vi tar utgangspunkt i oppgave 11, hvor vi skal lage en krystall-kule. Vi benytter oss av skjema-tegningene for å koble en LCD-skjerm til Arduino-brettet.

På Arduinobrettet laster vi opp følgende Sketch:

#include

LiquidCrystal lcd(12,11,5,4,3,2);

char a;

void setup() {
  Serial.begin(9600);
  lcd.begin(16,2);
  lcd.print("Ready for");
  lcd.setCursor(0,1);
  lcd.print("duty!");
}

void loop() {
  while (Serial.available() > 0 ){
    a = Serial.read();
    if (a == '!') {
      lcd.clear();
      lcd.setCursor(0,0);
    } else if (a == '?') {
      lcd.setCursor(0,1);
    } else {
      lcd.print(a);
    }
  }
}

 

 

Ardino-editor
Arduino-editoren

Vi benytter oss av LiquidCrystal biblioteket som følger med Arduino for å gjøre det enklere å skrive til brettet. Vi setter opp en Seriell-forbindelse og skriver en velkomstmelding til LCD-skjermen. Når LCD-skjermen er initialisert er det i hovedsak tre metoder vi benytter: print, clear og setCursor. Førstnevnte skriver en melding til skjermen, clear fjerner all tekst. setCursor benytter vi for å hoppe mellom skjermens to linjer. Metoden setCursor(0,0) setter skrivehodet øverst til venstre, og setCursor(0,1) setter skrivehodet til venstre i nedre rad.

Det er veldig lite kompleksitet på dette nivået, det eneste vi ønsker å gjøre er å kunne skrive, flytte mellom linjene og fjerne all tekst fra skjermen. Så lenge det er data i seriell-bufferet leser vi et tegn, og skriver dette til skjerm med mindre det er et spesialtegn. For dette programmet er “!” og “?” definert til å ha spesielle funksjoner. Utropstegnet fjerner all tekst og resetter skrivehodet til øvre venstre posisjon, spørsmålstegnet flytter skrivehodet til nedre venstre posisjon.

Skrive til skjerm

Når LCD-skjermen er satt opp og Arduino-koden er lastet opp kan man skrive til skjermen. Bruk CoolTerm (kan lastes ned her) og koble denne til Arduinobrettet (la brettet være koblet til via USB porten). Nå kan du skrive tekst som dukker opp på skjermen, samt kontrollere at spesialtegnene fungerer. Fungerer det? Kult, la oss koble oss opp mot Trafikanten/Ruter.

Trafikkdata fra Ruter

Ruter har offentliggjort mye av sin reiseinformasjon, og de har lagt ut sanntidsinformasjonen i tillegg til reiseplanleggeren, avvik og mye annet snasent. Les mer på Ruter sine API-sider

Vi skal benytte oss av to datasett fra Ruter, først må vi finne vår stasjons-ID ved hjelp av en kommaseparert fil med oversikt over alle stasjonene. Denne semi-statiske filen har meta-data som vi trenger for å kunne finne riktig stasjon fra REST JSON sanntids-APIet.

Finne stasjons-ID

For å finne riktig stasjon gjorde jeg en pragmatisk avgjøre om å ikke parse CSV-fila med alle stasjons-IDene. Jeg bare åpnet fila i TextMate (som jeg benytter for å redigere kode), og søkte etter Blindern T som ligger i nærheten av både jobb og hjem. Jeg fant ut Blindern T sin stasjons-ID er: 3010360

Hente og parse trafikkinformasjon

Oversiktlig, men maskinlesbar JSON strøm. Alle data vi trenger
Oversiktlig, men maskinlesbar JSON strøm. Alle data vi trenger

Ved å gå til datapunktet for sanntidsinformasjon ved denne IDen får jeg en JSON strøm/fil tilbake med alle avanger fra denne stasjonen: Blindern datapunkt.

I denne strømmen får vi mange objekter med avganger, og alle avganger har masse data vi kan bruke. JSON strømmen er ikke lett å lese og forstå, men så er den heller ikke laget for mennesker. Jeg benytter programmet VisualJSON for å lese JSON-filer.

Visual JSON is a handy tool for getting a more human readable representation of the data in the JSON stream
Visual JSON er et praktisk verktøy for å gjøre JSON-strømmer mer menneskevennlige

I VisualJSON kan vi se hvilke verdier i hver oppføring vi ønsker å ta vare på. DestinationDisplay inneholder navnet på endestasjonen, og LineRef inneholder linjenummeret. Praktisk å ta vare på disse. I tillegg trenger vi å lese ExpectedArrivalTime som har et noe merkelig tidsformat: “/Date(1378070709000+0200)/”.

Se hele implentasjonen? Mot slutten av siden finner du scriptet som henter, bearbeider, mellomlagrer og sender dataene skjermen.

Bearbeiding av datoene

De fleste dataene for hvert avgangsobjekt er klare for visning, men når det gjelder datoene trenger vi å gjøre noen endringer. Formatet vi fikk disse i er vanskelig å forstå for mennesker. Datamaskiner kan enkelt regne ut hvor mange sekunder som har passert siden torsdag 1 January 1970 (tidsformatet Epoch time), men for mennesker er dette vanskeligere. Vi parser derfor epoch formatet til datetime, og i fra dette formatet finder vi time delta (tidsdifferansen) mellom når avgangen er forventet og nå. Når vi tar denne tidsdifferansen og gjør den om til minutter har vi antall minutter til avgangen som vises.

Mellomlagring av data

Jeg har valgt å lage en klasse som inneholder de nødvendige dataene for hver avgang. Objektene vi oppretter av denne klassen legger jeg inn i en liste, hvor jeg henter ut ett og ett element. Når lista er tom går jeg til nett-tjenesten for å hente nye data. Det er fint å slippe å gå til tjenesten hele tiden for å hente data, samtidig må vi også bli oppdatert på eventuelle forsinkelser eller dersom antatt ankomsttid skulle flyttes framover i tid. Jeg syntes at å lagre alle avgangene i en kø for så å hente nye data når denne er tømt er en god løsning på denne utfordringen.

Formatere data for LCD-skjerm

For at dataene skal vises riktig må vi formatere de. Måten jeg har valgt å gjøre dette på er ved å lage setninger for hver linje. Altså først sende en melding om at skjermen må tømmes og resettes, deretter øverste linje med linjenummer og endestasjonsnavn. Når dette er sendt, venter jeg 0.2 sekund før jeg sender linjeskift-kommandoen (“?”), og linje nummer to. Jeg venter 0.2 sekunder mellom hver melding for å ikke mate skjermen med for mye informasjon samtidig. Når all tekst vises lar jeg denne stå i 4 sekunder før jeg går videre til neste avreise.

Implementasjon av programvaren

Nedenfor finner du hele scriptet som henter data fra Ruter, og som sender dette til Arduinobrettet

import urllib2, json, datetime, re, serial, time
from pprint import pprint

class TrafficInfo:
  def __init__(self, line_ref, destination, arrival_minutes, monitored, platform):
    self.line_ref = line_ref
    self.destination = destination
    self.arrival_minutes = arrival_minutes
    self.monitored = monitored
    self.platform = platform


def days_hours_minutes(td):
  return td.days, td.seconds//3600, (td.seconds//60)%60

def fetchTrafficFromStation(station_id):
  url_loc = "http://reis.trafikanten.no/reisrest/realtime/getrealtimedata/%s" % station_id
  response = urllib2.urlopen(url_loc)
  data = response.read()
  return data

station_id = "3010360"
ser = serial.Serial("/dev/tty.usbmodemfa131", 9600)
arr = []

while(True):
  if len(arr) == 0:
    print "Fetching new data"
  data = json.loads(fetchTrafficFromStation(station_id))

  for el in data:
    timecode = el["ExpectedArrivalTime"]
  m = re.split("(\d{13})\+(\d{4})",timecode)
  dt = datetime.datetime.fromtimestamp(float(m[1])/1000)
  tn = datetime.datetime.now()
  d = dt - tn
  td_human = days_hours_minutes(d)
  minutes = False
  if (d.seconds < (3600 - 1)): minutes = td_human[2]
    ti = TrafficInfo(el["LineRef"], el["DestinationDisplay"], minutes, el["Monitored"], el["DeparturePlatformName"])
  arr.append(ti)

  time.sleep(1)
  ar = arr.pop(0)
  time.sleep(1)
  ser.write("!")
  time.sleep(0.3)
  over_str = "%s : %s" % (ar.line_ref,ar.destination)
  ser.write(over_str)
  time.sleep(0.3)
  ser.write("?")
  time.sleep(0.3)
  under_str = "%s %s" % (str(ar.arrival_minutes), "Minutes")
  ser.write(under_str)
  time.sleep(4)

 

 

Du trenger ikke nødvendigvis en LCD-skjerm for å hente trafikkdata fra Ruter, og du trenger ikke nødvendigvis å vise trafikkdata dersom du har en LCD-skjerm. Det er bare fantasien som setter grenser. Hva med for eksempel værdata fra yr.no, eller temperaturen i kjøleskapet? Arduino er et flott sett for å leke med noen av mulighetene elektronikk og informatikk gir deg.

Appropos Arduino: I forrige uke var jeg med på en workshop med Tom Igoe, som har skrevet boken Making things Talk. Jeg har selv ikke lest denne boka, men har hørt at den er en god kilde til kunnskap og inspirasjon. Check it out!

Descriptive Statistical Methods in Code

Descriptive Statistical Methods in Code

Statistics can be mighty useful, and a little programming helps it getting even better. I often find that I through code can grasp the fundamental function behind how things work, and I have tried to apply this to statistics as well. In this case to descriptive statistics.

Join me in this short attempt to let statistics unconceal itself.

Getting ready: Extremistan and Mediocristan

This little text is looking closer on some of the descriptive methods of statistics, and to do that I want to use proper dataset. I have collected two data-sets following two different paradigms. The first dataset follows the rule of normal distribution/bell curve, derives from the physical world and is little susceptible to what Nassim Nicholas Taleb has identifies as the domain of Black Swans. In Mediocristan values seldom deviates, and when they do it’s not extremely.

On the other hand, our second dataset is from the social world where values does not conform to the bell curve, and where single values may change the whole dataset. In the domain of Extremistan values found includes many from the socio-cultural domains including number of cultural items sold, wealth, and number of Twitter followers.

Hights of 10 people  = 176, 182, 170, 186, 179, 183, 190, 183, 168, 180

Wealth of 10 people (imagined denomination) = 20, 100, 15, 5, 100, 10000, 1000, 30,  5, 200

To compute upon our values, we will save them into arrays.

[sourcecode language=”python”]

heights = [176, 182, 170, 186, 179, 183, 190, 183, 168, 180]
wealth = [20, 100, 15, 5, 100, 10000, 1000, 30, 5, 200]

[/sourcecode]

The count

This method is as simple as it gets. The count does not look at any of the numeric value stored, it just count the number of occurrences. In Python we have the comprehensive method len() to which we can pass the array and get returned the length.

[sourcecode language=”python”]

def count(data_arr):
return len(data_arr)

[/sourcecode]

The sum

To find the sum of the elements, we need to look in on the values and add these together. This is done by iterating through all the elements and add the value of all the elements to a single variable which we then return. Python has already a sum method implemented which does exactly what we need, but instead of using this we make our own: ownsum

[sourcecode language=”python”]

def ownsum(data_arr):
i = 0
for val in data_arr:
i += val
return i

[/sourcecode]

The min

All the values we have in both our arrays are comparable, the height is measured in centimetres and the wealth in a denomination. A higher number refers to more, and a lower number to less of that measure. The min, or minimum, finds the smallest value and returns this. To do this we need to start with the first number and always retain the smallest value we iterate over. In the implementation below the start value is set to the maximum value available to the system. To get this value will need to import the sys (system) library. Python already have a min method, so we implement our own: ownmin.

[sourcecode language=”python”]

import sys

def ownmin(data_arr):
i = sys.maxsize
for val in data_arr:
if val < i:
i = val
return i
[/sourcecode]

The max

The max method is somewhat a reversed min method. It looks at all the values and find the largest number. To get here we need to always retain the largest number as we iterate over the array. We start with the smallest possible value and then iterate. The max function is already implemented in Python, so we write our own: ownmax

[sourcecode language=”python”]

def ownmax(data_arr):
i = -sys.maxsize
for val in data_arr:
if val > i:
i = val
return i

[/sourcecode]

The range

The range is the numerical difference between the smallest and the largest value. This can give us a quick indication on the speadth of data. The range is fairly easy to find using subtraction. If we subtract the min from the max we are left with the range. We could write a whole function first performing the calculations for min and max and then do the subtraction, but since we already have implemented our own ownmin and ownmax methods, why not take advantage of our work already done. Python already has a function named range, so to avoid namespace problems let us implement the method ownrange

[sourcecode language=”python”]
def ownrange(data_arr):
return ownmax(data_arr) – ownmin(data_arr)
[/sourcecode]

The mean/arithmetic average

The mean, or the arithmetic average tells us something about the central tendency. The mode and the median average does so as well, but where the mode looks at occurrences and median look at the central element in an order list, the arithmetic average takes the sum of all elements and divide on the count. This is the most common way of finding the central tendency, but it is also prone to misunderstand large deviations in the dataset.

[sourcecode language=”python”]
def mean(data_arr):
return ownsum(data_arr) / count(data_arr)
[/sourcecode]

The median average

For the median average the values of each element is less importance, the paramount here is the position at where the elements can be found in relation to each other. The median is basically the middle element of an sorted array. If the array has an even number of elements the median is the aritmetic average of the two middle values. In the implementation of the method we first sort the array, then we get the length. If the length of the array is divisible by two (even number of elements) we return the average of the two middle values, if not we return the value of the middle element.

[sourcecode language=”python”]
def median(data_arr):
sorted_arr = sorted(data_arr)
length = len(sorted_arr)
if not length % 2:
return (sorted_arr[length / 2] + sorted_arr[length / 2 – 1]) / 2.0
return sorted_arr[length / 2]
[/sourcecode]

The mode average

The mode average is the most frequent value of the dataset. In our case this may not be representative, but if you have large datasets with many datapoints it makes more sense. The mode can be used trying to identify the distribution of the data. Often we have a mental anchor of the dataset having a unimodal distribution with a central peak with the most frequent occurrences of values in the centre and descending frequencies as we move away from the centre in both directions (e.g. the Empirical rule and bell curves), but datasets can also be bimodal with two centres towards each end of the spectre and few frequencies towards the centre. In a large dataset, the mode can help us spot these tendencies as it allows for more mode averages. For example a dataset with a range from 10 to 30 may have a unimodal distribution with a mode at 20, but also have a bimodal distribution with a mode at 15 and 25, this would be hard to spot for both the median and arithmetic averages.

To find the mode we need to create a dictionary. In this dictionary we create a key for each of the distinct values and a frequency of these. Once we have the dictionary, we need to iterate this to find the most frequent values. A mode can consist of more values, so we need to check the value of all the dictionary items against the max number of frequencies and return the element, or all elements that has this number of occurrences. Be aware that if none of the distinct values have more than one occurrences, all occurrences are returned.

[sourcecode language=”python”]
def mode(data_arr):
frequencies = {}
for i in data_arr:
frequencies[i] = frequencies.get(i, 0) + 1
max_freq = ownmax(frequencies.itervalues())
modes = [k for k, v in frequencies.iteritems() if v == max_freq]
return modes
[/sourcecode]

The variance of a population

The variance is used for finding the spread in the dataset. In the average methods above we found the central tendency of the dataset, and now we need to see how much the values of the dataset conforms to this number. For these values you can find whether you dataset is more extreme or mediocre (these are not binary groups, but gravitating centres). Once presented with an average from a dataset, a good follow up question is to ask for the standard deviation (next method). The variance is the sum of each values difference from the mean. To implement this method we first find the arithmetic mean for the dataset, and then iterate over all values comparing them to the average. Since the difference can be both higher and lower than the average, and we want to operate with positive numbers we square each value.

[sourcecode language=”python”]
def variance(data_arr):
variance = 0
average = mean(data_arr)
for i in data_arr:
variance += (average – i) **2
return variance / count(data_arr)
[/sourcecode]

The standard deviation

From the variance the way to the standard deviation is fairly simple. The standard deviation is the square root of the variance. This has the same measure as the average and is easier to compare. Here we take advantage of our last method and run a square root method on this to get the standard deviation. The sqrt method needs to be imported from the maths library. Be aware that standard deviation and variance is treated differently whether working with a sample or with a population. The differences is outside the scope of this article.

[sourcecode language=”python”]
import math
def stddev(data_arr):
return math.sqrt(variance(data_arr))
[/sourcecode]

The final code

We have now implemented some of the basic descriptive methods in statistics. The whole project can be found below. To run some of the functions. Call them by their method name and pass in an array. Two 10 elements long arrays can be found in the script, but the methods are implemented in such a way that you can pass with an array of any length.

[sourcecode language=”python”]

import sys
import math

heights = [176, 182, 170, 186, 179, 183, 190, 183, 168]
wealth = [20, 100, 15, 5, 100, 10000, 1000, 30, 5, 200]

def count(data_arr):
return len(data_arr)

def ownsum(data_arr):
i = 0
for val in data_arr:
i += val
return i

def ownmin(data_arr):
i = sys.maxsize
for val in data_arr:
if val < i: i = val return i def ownmax(data_arr): i = -sys.maxsize for val in data_arr: if val > i:
i = val
return i

def ownrange(data_arr):
return ownmax(data_arr) – ownmin(data_arr)

def mean(data_arr):
return ownsum(data_arr) / count(data_arr)

def median(data_arr):
sorted_arr = sorted(data_arr)
length = len(sorted_arr)
if not length % 2:
return (sorted_arr[length / 2] + sorted_arr[length / 2 – 1]) / 2.0
return sorted_arr[length / 2]

def mode(data_arr):
frequencies = {}
for i in data_arr:
frequencies[i] = frequencies.get(i, 0) + 1
max_freq = ownmax(frequencies.itervalues())
modes = [k for k, v in frequencies.iteritems() if v == max_freq]
return modes

def variance(data_arr):
variance = 0
average = mean(data_arr)
for i in data_arr:
variance += (average – i) **2
return variance / count(data_arr)

def stddev(data_arr):
return math.sqrt(variance(data_arr))

[/sourcecode]

Work programmatically with Google Spreadsheets

Work programmatically with Google Spreadsheets

Some time back I authored a script which reads through a CSV formatted list, and based on the artists’ names tried to decide the nationality of the artists by querying the last.FM search engine and parsing the XML structured result.

The script worked, and found and returned about 80% of the artists, and around 80% of these again had the conceptually similar artist. What if the alteration could be done in the document itself? For the CSV based script the artists had to be extracted from the database, then parsed and then put into the document. With this attempt we can skip the middle step, and let the script run simultaneous with data being available to view for the users.

If you haven’t already used Google Docs, you should consider it, as it is a very convenient way of working with documents, especially where there are more users. Through Google Docs you can collaborate on working on documents, and since they are stored in clouds all users will instantly have the latest version. Another advantage is that whey you are working on the document simultaneously each user is editing the same document, so merging different versions is not a problem.

A great tool for combining Google Spreadsheets with the Python programming environment is the gspread module. After installing this package you only need to import the code and then you can with very few lines of code retrieve and update information in the spreadsheet cells.

The specifics for doing this task is pretty much these few lines. Import the package. Login, find correct document and do whatever you need to. (this excerpt won’t work. Check the whole script under – NB: mind the indentation. It may not be correct displayed in the browser.)

[sourcecode language=”python”]
import gspread

G_USERNAME = ‘your@gmail.com’
G_PASSWORD = ‘yourPassword’
G_IDENTIFIER = ‘document_identifier_checkdocument_url_in_your_browser’

# Connecting to the data source
gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

[/sourcecode]

Above is the lines related to connecting Python to the google docs, under you can see the whole script and how the method I mentioned in a post earlier is used in this setting.

[sourcecode language=”python”]
#!/usr/bin/python
# -*- coding:utf-8 -*-

"""
Clouds &amp; Concerts – 2012
Ola Loevholm

Initialized from the commandline. Runs through The Google doc spreadsheet with topp 5000 artists, and
runs the parsing query against the Last.FM browser then enters the country and search string (for validation)
into the google docs.

"""

G_USERNAME = ‘your@gmail.com’
G_PASSWORD = ‘yourPassword’
G_IDENTIFIER = ‘document_identifier_checkdocument_url_in_your_browser’

import sys, urllib, string, csv, time
import xml.etree.ElementTree as ET
import gspread

# Loads a dictionary with ISO 3166-1 abbreviations and codes
COUNTRIES = {"AF":"AFGHANISTAN","AX":"ÅLAND ISLANDS","AL":"ALBANIA","DZ":"ALGERIA","AS":"AMERICAN SAMOA","AD":"ANDORRA","AO":"ANGOLA","AI":"ANGUILLA","AQ":"ANTARCTICA","AG":"ANTIGUA AND BARBUDA","AR":"ARGENTINA","AM":"ARMENIA","AW":"ARUBA","AU":"AUSTRALIA","AT":"AUSTRIA","AZ":"AZERBAIJAN","BS":"BAHAMAS","BH":"BAHRAIN","BD":"BANGLADESH","BB":"BARBADOS","BY":"BELARUS","BE":"BELGIUM","BZ":"BELIZE","BJ":"BENIN","BM":"BERMUDA","BT":"BHUTAN","BO":"BOLIVIA, PLURINATIONAL STATE OF","BQ":"BONAIRE, SINT EUSTATIUS AND SABA","BA":"BOSNIA AND HERZEGOVINA","BW":"BOTSWANA","BV":"BOUVET ISLAND","BR":"BRAZIL","IO":"BRITISH INDIAN OCEAN TERRITORY","BN":"BRUNEI DARUSSALAM","BG":"BULGARIA","BF":"BURKINA FASO","BI":"BURUNDI","KH":"CAMBODIA","CM":"CAMEROON","CA":"CANADA","CV":"CAPE VERDE","KY":"CAYMAN ISLANDS","CF":"CENTRAL AFRICAN REPUBLIC","TD":"CHAD","CL":"CHILE","CN":"CHINA","CX":"CHRISTMAS ISLAND",
"CC":"COCOS (KEELING) ISLANDS","CO":"COLOMBIA","KM":"COMOROS","CG":"CONGO","CD":"CONGO, THE DEMOCRATIC REPUBLIC OF THE","CK":"COOK ISLANDS","CR":"COSTA RICA","CI":"CÔTE D’IVOIRE","HR":"CROATIA","CU":"CUBA","CW":"CURAÇAO","CY":"CYPRUS","CZ":"CZECH REPUBLIC","DK":"DENMARK","DJ":"DJIBOUTI","DM":"DOMINICA","DO":"DOMINICAN REPUBLIC","EC":"ECUADOR","EG":"EGYPT","SV":"EL SALVADOR","GQ":"EQUATORIAL GUINEA","ER":"ERITREA","EE":"ESTONIA","ET":"ETHIOPIA","FK":"FALKLAND ISLANDS (MALVINAS)","FO":"FAROE ISLANDS","FJ":"FIJI","FI":"FINLAND","FR":"FRANCE","GF":"FRENCH GUIANA","PF":"FRENCH POLYNESIA","TF":"FRENCH SOUTHERN TERRITORIES","GA":"GABON","GM":"GAMBIA","GE":"GEORGIA","DE":"GERMANY","GH":"GHANA","GI":"GIBRALTAR","GR":"GREECE","GL":"GREENLAND","GD":"GRENADA","GP":"GUADELOUPE","GU":"GUAM","GT":"GUATEMALA","GG":"GUERNSEY","GN":"GUINEA","GW":"GUINEA-BISSAU","GY":"GUYANA","HT":"HAITI","HM":"HEARD ISLAND AND MCDONALD ISLANDS",
"VA":"HOLY SEE (VATICAN CITY STATE)","HN":"HONDURAS","HK":"HONG KONG","HU":"HUNGARY","IS":"ICELAND","IN":"INDIA","ID":"INDONESIA","IR":"IRAN, ISLAMIC REPUBLIC OF","IQ":"IRAQ","IE":"IRELAND","IM":"ISLE OF MAN","IL":"ISRAEL","IT":"ITALY","JM":"JAMAICA","JP":"JAPAN","JE":"JERSEY","JO":"JORDAN","KZ":"KAZAKHSTAN","KE":"KENYA","KI":"KIRIBATI","KP":"KOREA, DEMOCRATIC PEOPLE’S REPUBLIC OF","KR":"KOREA, REPUBLIC OF","KW":"KUWAIT","KG":"KYRGYZSTAN","LA":"LAO PEOPLE’S DEMOCRATIC REPUBLIC","LV":"LATVIA","LB":"LEBANON","LS":"LESOTHO","LR":"LIBERIA","LY":"LIBYA","LI":"LIECHTENSTEIN","LT":"LITHUANIA","LU":"LUXEMBOURG","MO":"MACAO","MK":"MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF","MG":"MADAGASCAR","MW":"MALAWI","MY":"MALAYSIA","MV":"MALDIVES","ML":"MALI","MT":"MALTA","MH":"MARSHALL ISLANDS","MQ":"MARTINIQUE","MR":"MAURITANIA","MU":"MAURITIUS","YT":"MAYOTTE","MX":"MEXICO","FM":"MICRONESIA, FEDERATED STATES OF",
"MD":"MOLDOVA, REPUBLIC OF","MC":"MONACO","MN":"MONGOLIA","ME":"MONTENEGRO","MS":"MONTSERRAT","MA":"MOROCCO","MZ":"MOZAMBIQUE","MM":"MYANMAR","NA":"NAMIBIA","NR":"NAURU","NP":"NEPAL","NL":"NETHERLANDS","NC":"NEW CALEDONIA","NZ":"NEW ZEALAND","NI":"NICARAGUA","NE":"NIGER","NG":"NIGERIA","NU":"NIUE","NF":"NORFOLK ISLAND","MP":"NORTHERN MARIANA ISLANDS","NO":"NORWAY","OM":"OMAN","PK":"PAKISTAN","PW":"PALAU","PS":"PALESTINIAN TERRITORY, OCCUPIED","PA":"PANAMA","PG":"PAPUA NEW GUINEA","PY":"PARAGUAY","PE":"PERU","PH":"PHILIPPINES","PN":"PITCAIRN","PL":"POLAND","PT":"PORTUGAL","PR":"PUERTO RICO","QA":"QATAR","RE":"RÉUNION","RO":"ROMANIA","RU":"RUSSIAN FEDERATION","RW":"RWANDA","BL":"SAINT BARTHÉLEMY","SH":"SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA","KN":"SAINT KITTS AND NEVIS","LC":"SAINT LUCIA","MF":"SAINT MARTIN (FRENCH PART)","PM":"SAINT PIERRE AND MIQUELON","VC":"SAINT VINCENT AND THE GRENADINES",
"WS":"SAMOA","SM":"SAN MARINO","ST":"SAO TOME AND PRINCIPE","SA":"SAUDI ARABIA","SN":"SENEGAL","RS":"SERBIA","SC":"SEYCHELLES","SL":"SIERRA LEONE","SG":"SINGAPORE","SX":"SINT MAARTEN (DUTCH PART)","SK":"SLOVAKIA","SI":"SLOVENIA","SB":"SOLOMON ISLANDS","SO":"SOMALIA","ZA":"SOUTH AFRICA","GS":"SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS","SS":"SOUTH SUDAN","ES":"SPAIN","LK":"SRI LANKA","SD":"SUDAN","SR":"SURINAME","SJ":"SVALBARD AND JAN MAYEN","SZ":"SWAZILAND","SE":"SWEDEN","CH":"SWITZERLAND","SY":"SYRIAN ARAB REPUBLIC","TW":"TAIWAN, PROVINCE OF CHINA","TJ":"TAJIKISTAN","TZ":"TANZANIA, UNITED REPUBLIC OF","TH":"THAILAND","TL":"TIMOR-LESTE","TG":"TOGO","TK":"TOKELAU","TO":"TONGA","TT":"TRINIDAD AND TOBAGO","TN":"TUNISIA","TR":"TURKEY","TM":"TURKMENISTAN","TC":"TURKS AND CAICOS ISLANDS","TV":"TUVALU","UG":"UGANDA","UA":"UKRAINE","AE":"UNITED ARAB EMIRATES","GB":"UNITED KINGDOM","US":"UNITED STATES",
"UM":"UNITED STATES MINOR OUTLYING ISLANDS","UY":"URUGUAY","UZ":"UZBEKISTAN","VU":"VANUATU","VE":"VENEZUELA, BOLIVARIAN REPUBLIC OF","VN":"VIET NAM","VG":"VIRGIN ISLANDS, BRITISH","VI":"VIRGIN ISLANDS, U.S.","WF":"WALLIS AND FUTUNA","EH":"WESTERN SAHARA","YE":"YEMEN","ZM":"ZAMBIA","ZW":"ZIMBABWE"}

&nbsp;

# Connecting to the data source
gc = gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)
# Iterates through XML-structure and removes the namespace, for easier navigation in getCountry()s ElementTree.findall()
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}’ % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
# getCountry – where the magic happens. Encodes string with artistname to url, then query musicbrainz search engine.
# parses the XML-answer and get the name, id and country of the first returned element (with highest weight)
# returns country name i a) artist is found through the search engine b) artist has a country associated to the profile, otherwise returns False
def getCountry(name):
name = urllib.quote_plus(name)
BASE_URL = "http://musicbrainz.org/ws/2/artist/?query=%s&amp;format=xml&amp;method=advanced" % (name)
print "Querying: %s" % (BASE_URL)
try:
search_input = urllib.urlopen(BASE_URL)
# Checks whether HTTP Request Code is 200 – if not goes to sleep for 5 seconds // Inded for 503 Code
http_code = search_input.code
if http_code != 200:
# print "Could not access: %s \t Got HTTP Code: %s. 5 second cool-down" % (name, http_code)
time.sleep(5)
getCountry(name)
except Exception:
print "GETTING_ERROR: Something went wrong while getting HTTP"
return False
#search_xml = search_input.read()
#print search_xml
try:
tree = ET.parse(search_input)
remove_namespace(tree, u’http://musicbrainz.org/ns/mmd-2.0#’)
feed = tree.getroot()
elem = feed.findall("./artist-list/")
#print elem[0].find(‘name’).text
#print elem[0].get(‘id’)
except Exception:
print "PARSE_ERROR: Something went wrong while parsing HTTP"
return False
try:
if elem[0].find(‘country’) != None:
# print COUNTRIES[elem[0].find(‘country’).text]
try:
country = COUNTRIES[elem[0].find(‘country’).text]
except Exception:
print "Could not find key in countrylist error"
return False
return [country,elem[0].get(‘id’),elem[0].find(‘name’).text]
else:
print elem[0].find(‘name’).text + " has not any country associated\n"
return False
except IndexError, ValueError:
print "ERROR – COULD NOT GET DATA FROM %s\n" % (name)
return False

#Running through the values
get_val = []
set_name = []
set_country = []
for a in range(2,5124):
v = "B%s" % a
sn = "H%s" % a
sc = "G%s" % a
get_val.append(v)
set_name.append(sn)
set_country.append(sc)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

[/sourcecode]

Data-wrangling: find country based on artist name

Data-wrangling: find country based on artist name

At the Clouds & Concerts project at the University of Oslo we are working with really interesting topics, based on interesting empirical data. Through our collaboration with the Norwegian streaming service provider WiMP we are together with Telenor and WiMP analysing a vast collection of data. More about the project’s data-part, also the ‘Clouds’ part of the project’s name can be found on the project’s web sites.

Artist and Country

One of the tasks at hand was to find out which country an artist came from, and whether they came from Norway or not. One way of doing this is to manually go through each artist and use preexisting knowledge about music to determine their country of origin, if stuck, use online services (aren’t we all mostly using Google as initial source of wisdom). Another alternative is to use online services first and then to use human preexisting knowledge to quality assure the final result.

On the Internet, vast amount of sources can be found. However, if you want to get the data without too much fitting, testing, and nitty gritty adaptation for every source, you have an advantage if there is as consistently structured repository you can tap from. Luckily, Metabrainz foundation has a large repository of musical meta-data known as Musicbrainz.

Below you find a script which should (partially) solve our problem by combining the data from Musicbrainz with data exported from our research data. That being said, this script is more a method than a finished product. It should be very easy adaptable, but it is an advantage if you know Python and handling CSV-files. Codeacademy has a good introduction to Python.

The core idea of the script is to take input with name and number of streams and turn them into output with name from the original datasource, number of streams from the original datasource, as well as country of origin, Musicbrainz-ID, and name parsed by Musicbrainz search engine (for initial quality assurance).

To make things simple there is only one successful output and that is if the name sent to the Musicbrainz search engine return an answer, and if that answer has a country associated with it. Be advised (that is also why I have marked the title with ‘try’), that the search engine may not return a similar result. Of that reason we also print the name of the artist we find so this later can be juxtaposed with the original name in the Excel spreadsheet (you are going to transform the CSV to Excel before reviewing aren’t you. Good tool is Google Refine). Another problem is that popular cultural phenomena, common nouns and tribute bands (probably in that order, descending) have same name. This is why a human is always needed, or semantic absolute URI associated with each phenomena. This leads me on to the last step before the code.

Other ways this could have been solved (let me know if you solve the problem in any of these ways)

The semantic way:

The data found in the Musicbrainz database is made available through a SPARQL endpoint named LinkedBrainz.If you know the right ontologies and is comfortable with performing triplestore queries, this is perhaps the most innovative and new-thinking way to solve the problem.

The Virtual Machine Postgres way:

Instead of doing a query on the server, you can be a gentleman and download the server onto your own machine. If you have VirtualBox (if you don’t have it, download it for free) you can run the server locally. An image file with the complete Musicbrainz database can be found on their webpages.

The code:

Here is the code used to solve this task. It can also be cloned from the Cloud & Concerts GitHub-page

[sourcecode language=”python”]
#!/usr/bin/python
# -*- coding:utf-8 -*-

"""
Clouds & Concerts – 2012
Ola Loevholm

Called from command line:
The script reads a file named "topp1000_artister.csv" consisting of a list of artists and then tries to find out which country each artist comes from based on the name.
The name is given in the second column of the CSV file.

Called as a module:
The method getCountry() takes an artist name and checks this with musicbrainz seach engine. Returns the country if a) artist is found through the search engine b) artist has a country associated to the profile

"""

import sys, urllib, string, csv, time
import xml.etree.ElementTree as ET

# Loads a dictionary with ISO 3166-1 abbreviations and countries
COUNTRIES = {"AF":"AFGHANISTAN","AX":"ÅLAND ISLANDS","AL":"ALBANIA","DZ":"ALGERIA","AS":"AMERICAN SAMOA","AD":"ANDORRA","AO":"ANGOLA","AI":"ANGUILLA","AQ":"ANTARCTICA","AG":"ANTIGUA AND BARBUDA","AR":"ARGENTINA","AM":"ARMENIA","AW":"ARUBA","AU":"AUSTRALIA","AT":"AUSTRIA","AZ":"AZERBAIJAN","BS":"BAHAMAS","BH":"BAHRAIN","BD":"BANGLADESH","BB":"BARBADOS","BY":"BELARUS","BE":"BELGIUM","BZ":"BELIZE","BJ":"BENIN","BM":"BERMUDA","BT":"BHUTAN","BO":"BOLIVIA, PLURINATIONAL STATE OF","BQ":"BONAIRE, SINT EUSTATIUS AND SABA","BA":"BOSNIA AND HERZEGOVINA","BW":"BOTSWANA","BV":"BOUVET ISLAND","BR":"BRAZIL","IO":"BRITISH INDIAN OCEAN TERRITORY","BN":"BRUNEI DARUSSALAM","BG":"BULGARIA","BF":"BURKINA FASO","BI":"BURUNDI","KH":"CAMBODIA","CM":"CAMEROON","CA":"CANADA","CV":"CAPE VERDE","KY":"CAYMAN ISLANDS","CF":"CENTRAL AFRICAN REPUBLIC","TD":"CHAD","CL":"CHILE","CN":"CHINA","CX":"CHRISTMAS ISLAND",
"CC":"COCOS (KEELING) ISLANDS","CO":"COLOMBIA","KM":"COMOROS","CG":"CONGO","CD":"CONGO, THE DEMOCRATIC REPUBLIC OF THE","CK":"COOK ISLANDS","CR":"COSTA RICA","CI":"CÔTE D’IVOIRE","HR":"CROATIA","CU":"CUBA","CW":"CURAÇAO","CY":"CYPRUS","CZ":"CZECH REPUBLIC","DK":"DENMARK","DJ":"DJIBOUTI","DM":"DOMINICA","DO":"DOMINICAN REPUBLIC","EC":"ECUADOR","EG":"EGYPT","SV":"EL SALVADOR","GQ":"EQUATORIAL GUINEA","ER":"ERITREA","EE":"ESTONIA","ET":"ETHIOPIA","FK":"FALKLAND ISLANDS (MALVINAS)","FO":"FAROE ISLANDS","FJ":"FIJI","FI":"FINLAND","FR":"FRANCE","GF":"FRENCH GUIANA","PF":"FRENCH POLYNESIA","TF":"FRENCH SOUTHERN TERRITORIES","GA":"GABON","GM":"GAMBIA","GE":"GEORGIA","DE":"GERMANY","GH":"GHANA","GI":"GIBRALTAR","GR":"GREECE","GL":"GREENLAND","GD":"GRENADA","GP":"GUADELOUPE","GU":"GUAM","GT":"GUATEMALA","GG":"GUERNSEY","GN":"GUINEA","GW":"GUINEA-BISSAU","GY":"GUYANA","HT":"HAITI","HM":"HEARD ISLAND AND MCDONALD ISLANDS",
"VA":"HOLY SEE (VATICAN CITY STATE)","HN":"HONDURAS","HK":"HONG KONG","HU":"HUNGARY","IS":"ICELAND","IN":"INDIA","ID":"INDONESIA","IR":"IRAN, ISLAMIC REPUBLIC OF","IQ":"IRAQ","IE":"IRELAND","IM":"ISLE OF MAN","IL":"ISRAEL","IT":"ITALY","JM":"JAMAICA","JP":"JAPAN","JE":"JERSEY","JO":"JORDAN","KZ":"KAZAKHSTAN","KE":"KENYA","KI":"KIRIBATI","KP":"KOREA, DEMOCRATIC PEOPLE’S REPUBLIC OF","KR":"KOREA, REPUBLIC OF","KW":"KUWAIT","KG":"KYRGYZSTAN","LA":"LAO PEOPLE’S DEMOCRATIC REPUBLIC","LV":"LATVIA","LB":"LEBANON","LS":"LESOTHO","LR":"LIBERIA","LY":"LIBYA","LI":"LIECHTENSTEIN","LT":"LITHUANIA","LU":"LUXEMBOURG","MO":"MACAO","MK":"MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF","MG":"MADAGASCAR","MW":"MALAWI","MY":"MALAYSIA","MV":"MALDIVES","ML":"MALI","MT":"MALTA","MH":"MARSHALL ISLANDS","MQ":"MARTINIQUE","MR":"MAURITANIA","MU":"MAURITIUS","YT":"MAYOTTE","MX":"MEXICO","FM":"MICRONESIA, FEDERATED STATES OF",
"MD":"MOLDOVA, REPUBLIC OF","MC":"MONACO","MN":"MONGOLIA","ME":"MONTENEGRO","MS":"MONTSERRAT","MA":"MOROCCO","MZ":"MOZAMBIQUE","MM":"MYANMAR","NA":"NAMIBIA","NR":"NAURU","NP":"NEPAL","NL":"NETHERLANDS","NC":"NEW CALEDONIA","NZ":"NEW ZEALAND","NI":"NICARAGUA","NE":"NIGER","NG":"NIGERIA","NU":"NIUE","NF":"NORFOLK ISLAND","MP":"NORTHERN MARIANA ISLANDS","NO":"NORWAY","OM":"OMAN","PK":"PAKISTAN","PW":"PALAU","PS":"PALESTINIAN TERRITORY, OCCUPIED","PA":"PANAMA","PG":"PAPUA NEW GUINEA","PY":"PARAGUAY","PE":"PERU","PH":"PHILIPPINES","PN":"PITCAIRN","PL":"POLAND","PT":"PORTUGAL","PR":"PUERTO RICO","QA":"QATAR","RE":"RÉUNION","RO":"ROMANIA","RU":"RUSSIAN FEDERATION","RW":"RWANDA","BL":"SAINT BARTHÉLEMY","SH":"SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA","KN":"SAINT KITTS AND NEVIS","LC":"SAINT LUCIA","MF":"SAINT MARTIN (FRENCH PART)","PM":"SAINT PIERRE AND MIQUELON","VC":"SAINT VINCENT AND THE GRENADINES",
"WS":"SAMOA","SM":"SAN MARINO","ST":"SAO TOME AND PRINCIPE","SA":"SAUDI ARABIA","SN":"SENEGAL","RS":"SERBIA","SC":"SEYCHELLES","SL":"SIERRA LEONE","SG":"SINGAPORE","SX":"SINT MAARTEN (DUTCH PART)","SK":"SLOVAKIA","SI":"SLOVENIA","SB":"SOLOMON ISLANDS","SO":"SOMALIA","ZA":"SOUTH AFRICA","GS":"SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS","SS":"SOUTH SUDAN","ES":"SPAIN","LK":"SRI LANKA","SD":"SUDAN","SR":"SURINAME","SJ":"SVALBARD AND JAN MAYEN","SZ":"SWAZILAND","SE":"SWEDEN","CH":"SWITZERLAND","SY":"SYRIAN ARAB REPUBLIC","TW":"TAIWAN, PROVINCE OF CHINA","TJ":"TAJIKISTAN","TZ":"TANZANIA, UNITED REPUBLIC OF","TH":"THAILAND","TL":"TIMOR-LESTE","TG":"TOGO","TK":"TOKELAU","TO":"TONGA","TT":"TRINIDAD AND TOBAGO","TN":"TUNISIA","TR":"TURKEY","TM":"TURKMENISTAN","TC":"TURKS AND CAICOS ISLANDS","TV":"TUVALU","UG":"UGANDA","UA":"UKRAINE","AE":"UNITED ARAB EMIRATES","GB":"UNITED KINGDOM","US":"UNITED STATES",
"UM":"UNITED STATES MINOR OUTLYING ISLANDS","UY":"URUGUAY","UZ":"UZBEKISTAN","VU":"VANUATU","VE":"VENEZUELA, BOLIVARIAN REPUBLIC OF","VN":"VIET NAM","VG":"VIRGIN ISLANDS, BRITISH","VI":"VIRGIN ISLANDS, U.S.","WF":"WALLIS AND FUTUNA","EH":"WESTERN SAHARA","YE":"YEMEN","ZM":"ZAMBIA","ZW":"ZIMBABWE"}

# Iterates through XML-structure and removes the namespace, for easier navigation in getCountry()s ElementTree.findall()
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}’ % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]

# getCountry – where the magic happens. Encodes string with artistname to url, then query musicbrainz search engine.
# parses the XML-answer and get the name, id and country of the first returned element (with highest weight)
# returns country name i a) artist is found through the search engine b) artist has a country associated to the profile, otherwise returns False
def getCountry(name):
name = urllib.quote_plus(name)
BASE_URL = "http://musicbrainz.org/ws/2/artist/?query=%s&format=xml&method=advanced" % (name)
print "Querying: %s" % (BASE_URL)
try:
search_input = urllib.urlopen(BASE_URL)
# Checks whether HTTP Request Code is 200 – if not goes to sleep for 5 seconds // Inded for 503 Code
http_code = search_input.code
if http_code != 200:
# print "Could not access: %s \t Got HTTP Code: %s. 5 second cool-down" % (name, http_code)
time.sleep(5)
getCountry(name)
except Exception:
print "GETTING_ERROR: Something went wrong while getting HTTP"
return False
#search_xml = search_input.read()
#print search_xml
try:
tree = ET.parse(search_input)
remove_namespace(tree, u’http://musicbrainz.org/ns/mmd-2.0#’)
feed = tree.getroot()
elem = feed.findall("./artist-list/")
#print elem[0].find(‘name’).text
#print elem[0].get(‘id’)
except Exception:
print "PARSE_ERROR: Something went wrong while parsing HTTP"
return False
try:
if elem[0].find(‘country’) != None:
# print COUNTRIES[elem[0].find(‘country’).text]
try:
country = COUNTRIES[elem[0].find(‘country’).text]
except Exception:
print "Could not find key in countrylist error"
return False
return [country,elem[0].get(‘id’),elem[0].find(‘name’).text]
else:
print elem[0].find(‘name’).text + " has not any country associated\n"
return False
except IndexError, ValueError:
print "ERROR – COULD NOT GET DATA FROM %s\n" % (name)
return False

# If method is called from terminal. Iterates through topp1000 artists contained in a CSV-file in same directory.
if __name__ == "__main__":
#name = sys.argv[1]
csvfile = open("topp1000_artister.csv")
outfile = open("topp1000_output.csv","w")
artistlist = csv.reader(csvfile, delimiter=’,’, quotechar=’"’)
for line in artistlist:
result = getCountry(line[1])
try:
if result != False:
result_string = "%s,%s,%s,%s,%s,%s\n" % (line[0],line[1],line[2],result[0],result[1],result[2])
# print result_string
else:
result_string = "%s,%s,%s,%s\n" % (line[0],line[1],line[2],"No Country Found or fail occured")
# print result_string
except IndexError, ValueError:
print e
result_string = "Error on element: %s\n" % line[1]
try:
outfile.write(result_string)
except:
print "Write error happened with %s" % line[1]
[/sourcecode]

And as always, I am most greatful for feedback! Hope this may come in handy!

Create Thumbnails Programatically

Create Thumbnails Programatically

If you have several images following a certain structure on a web page and want them as thumbnails, it can be useful to programatically create these. The manual way of creating thumbnails (using Photoshop or similar) can often be time consuming while the execution time for a script resizing an image is counted in microseconds. If you already have a script this can also be fitted for other similar situations. With the Python Image Library (PIL) thumbnails can be created in no-time. 

In this real word example a Python scripts runs through the folder in which the script is located, resizes each image of the type png and jpg (hopefully the filenames corresponds to the format), when the images have been resized they are put inside a new folder.

Note that PIL ensures a correct aspect ratio and that the size variable just sets the boundary. In the code bellow the important feature is to have images that not exceeds 150px in width (standard convention with width and height). Passing the PIL Image object the size will transform the object into a thumbnail. The method by which the transformation should be performed can be added as an optional parameter. From the documentation: “The filter argument can be one of NEAREST, BILINEAR, BICUBIC, or ANTIALIAS (best quality). If omitted, it defaults to NEAREST.”

#!/usr/bin/python
# -*- coding: utf-8 -*-
import Image
import os
#Imports the Image and os libraries. os is part of standard libraries. Image is part of Python Image Library (PIL)
#PIL can be downloaded from: http://www.pythonware.com/products/pil/

#Creates sets and creates the directory onto where you want your files to be saved
outdir = "150images/"
os.mkdir("./"+outdir)
#If the directory already exists this will cause an OSError.

size = 150, 400 #Set the size that you want to resize your image to.
#Thumbnail automatically checks for ratio consistency so alter the important variable (height or width)

for files in os.listdir("."):
#Sets the appropriate suffix to your files.
outfile = os.path.splitext(files)[0] + "_thumbnail.jpg"
#Transforms JPG formatted files
if files.endswith(".jpg"):
im = Image.open(files)
try:
im.thumbnail(size, Image.ANTIALIAS)
im.save(outdir+outfile, "JPEG")
print "Saved the file: %s" % outdir+outfile
except IOError:
print "cannot create thumbnail for '%s'" % infile
#Transforms PNG formatted files
if files.endswith(".png"):
im = Image.open(files)
try:
im.thumbnail(size, Image.ANTIALIAS)
im.save(outdir+outfile, "JPEG")
print "Saved the file: %s" % outfile
except IOError:
print "cannot create thumbnail for '%s'" % infile

 

 

The script will execute once, but as you execute twice you will get an OSError, this is due to the folder being created in the beginning of the script. I have chosen to name this folder after the significant size boundary in the script (maximum 150px width). PIL is not a part of the standard library, but can easily be installed through pip or easy_install. Python is available on almost every platform and comes preinstalled on the Mac.

Åpne stortingsdata

Åpne stortingsdata

For en stund tilbake skrev jeg en liten post om hvordan stortingsdata kunne hentes ned fra Stortingets hjemmesider programmatisk. Som et utgangspunkt for å benytte data, enten til lek eller alvor, mener jeg offentlige data fungerer utmerket. Vi lever jo i et demokratisk samfunn, og uansett hvor mye mye tillit du måtte ha til politikerne eller mediene (eller hvor lite) har du som borger en rett til å gjøre dine egne analyser av offentlig informasjon og forhåpentligvis finner du noe interessant som kan hjelpe deg i å gjøre opp en mening om en konkret sak, eller hvem du skal stemme på ved neste valg. Det er også en fordel at offentlighetsloven legger til rette for at du skal kunne gjøre det du vil uten at du først trenger å spørre om lov.

Sist vi benyttet oss av stortingsdata måtte vi benytte en del teknikker for å få tak i de dataene vi ønsket. Vi parset HTML-sider og brukte regulære utrykk for å “reverse engineere” sidene til Stortinget for å hente ut representantenes IDer, navn, fødselsdato og kjønn. Siden den gang har Stortinget valgt å legge ut sine data til offentligheten i et maskinlesbart format slik at det skal bli enklere å undersøke disse. Dette initativet går inn i ny, men stadig økende tradisjon for åpne data (eller Open Public Data, som jeg valgte å kalle dem i min masteroppgave).

data.stortinget.no kan du finne data om saker, sesjoner, representanter, temaer, komiteer og annet som er relevant for den daglige driften av landets lovgivende forsamling. Denne informasjonen har hele tiden vært tilgjengelig for offentligheten, men nå er den også tilgjengelig i et maskinlesbart format. På den nye siden finner du en oversikt over dataene, eksempler på bruk, men også en databygger.

Hvis du har lyst til å komme igang med bruk av dataene, eller bare ønsker å ha en lokal kopi av noen av hoveddataene (med dette tenker jeg strukturelle data som kategorier, representanter, komiteer, fylker, sesjoner og perioder) kan du ta utgangspunkt i dette Python-scriptet som laster ned noen av dataene og legger disse inn i SQLite database. Dette burde være et greit utgangspunkt for videre eksperimentering.

Script “getBasicData.py”:

[sourcecode language=”python”]
# -*- coding: UTF-8 -*-
import sqlite3
import httplib
import urllib2
import os
from xml.dom import minidom, Node
from xml.etree import ElementTree

SITE = "http://data.stortinget.no/eksport/"
DATA = "data.db"

def get_perioder(cur):
DOK = "stortingsperioder"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK
if page:
tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[2]
elements = list(top)
for el in elements:
       fra = el.find(‘{http://data.stortinget.no}fra’).text
per_id = el.find(‘{http://data.stortinget.no}id’).text
til = el.find(‘{http://data.stortinget.no}til’).text
print "id: %s fra: %s til: %s" % (per_id, fra, til)
cur.execute("""INSERT INTO perioder(fra, id, til) VALUES(‘%s’,’%s’,’%s’)""" % (fra, per_id, til))
else:
print "Could not load page: "+DOK
return cur

def get_sesjoner(cur):
DOK = "sesjoner"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK
if page:
tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[2]
elements = list(top)
for el in elements:
fra = el.find(‘{http://data.stortinget.no}fra’).text
ses_id = el.find(‘{http://data.stortinget.no}id’).text
til = el.find(‘{http://data.stortinget.no}til’).text
assert attribute in (fra, ses_id, til)
print "id: %s fra: %s til: %s" % (ses_id, fra, til)
cur.execute("""INSERT INTO sesjoner(fra, id, til) VALUES(%s, %s, %s)""" % (fra, ses_id, til))
else:
print "Could not load page: "+DOK
return cur

def get_emner(cur):
DOK = "emner"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK

if not page:
print "Could not load page:!! "+DOK
return
tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[1]
elements = list(top)
for el in elements:
navn = el.find(‘{http://data.stortinget.no}navn’).text
main_emne_id = el.find("{http://data.stortinget.no}id").text
print "HOVED: %s %s" % (navn, main_emne_id)
cur.execute("""INSERT INTO hovedemner(id, navn) VALUES(‘%s’,’%s’);""" % (main_emne_id, navn))
if("true" in el.find("{http://data.stortinget.no}er_hovedemne").text):
for uel in el.find("{http://data.stortinget.no}underemne_liste"):
navn = uel.find("{http://data.stortinget.no}navn").text
emne_id = uel.find("{http://data.stortinget.no}id").text
print "UNDER: %s %s, horer til: %s" % (navn, emne_id, main_emne_id)
cur.execute("""INSERT INTO underemner(id, navn, hovedemne_id) VALUES(‘%s’, ‘%s’, ‘%s’);""" % (emne_id, navn, main_emne_id))
return cur

def get_fylker(cur):
DOK = "fylker"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK

tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[1]
elements = list(top)
for el in elements:
fylke_id = el.find("{http://data.stortinget.no}id").text
navn =  el.find("{http://data.stortinget.no}navn").text
print ("id: %s, navn: %s") % (fylke_id, navn)
cur.execute("""INSERT INTO fylker(id, navn) VALUES(‘%s’,’%s’);""" % (fylke_id, navn))

return cur

def get_partier(cur):
DOK = "allepartier"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK

tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[1]
elements = list(top)
for el in elements:
parti_id = el.find("{http://data.stortinget.no}id").text
navn =  el.find("{http://data.stortinget.no}navn").text
print ("id: %s, navn: %s") % (parti_id, navn)
cur.execute("""INSERT INTO partier(id, navn) VALUES(‘%s’,’%s’);""" % (parti_id, navn))

return cur

def get_komiteer(cur):
DOK = "allekomiteer"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK

tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[1]
elements = list(top)
for el in elements:
kom_id = el.find("{http://data.stortinget.no}id").text
navn = el.find("{http://data.stortinget.no}navn").text
print "id: %s navn: %s" % (kom_id, navn)
cur.execute("""INSERT INTO komiteer(id, navn) VALUES(‘%s’,’%s’);""" % (kom_id, navn))
return cur

def get_representanter(cur):
DOK = "dagensrepresentanter"
try:
page = urllib2.urlopen(SITE+DOK)
except:
print "Failed to fetch item "+DOK

tree = ElementTree.parse(page)
root = tree.getroot()
top = list(root)[1]
elements = list(top)
for el in elements:
doedsdato = el.find("{http://data.stortinget.no}doedsdato").text
etternavn = el.find("{http://data.stortinget.no}etternavn").text
foedselsdato = el.find("{http://data.stortinget.no}foedselsdato").text
fornavn = el.find("{http://data.stortinget.no}fornavn").text
repr_id = el.find("{http://data.stortinget.no}id").text
kjoenn = el.find("{http://data.stortinget.no}kjoenn").text
fylke = el.find("{http://data.stortinget.no}fylke/{http://data.stortinget.no}id").text
parti = el.find("{http://data.stortinget.no}parti/{http://data.stortinget.no}id").text
#komiteer = el.find("{http://data.stortinget.no}komiteer_liste/{http://data.stortinget.no}komite/{http://data.stortinget.no}id").text
print "repr: %s, %s %s, parti: %s, fylke: %s" % (repr_id, fornavn, etternavn, parti, fylke)
cur.execute("""INSERT INTO representanter(doedsdato, etternavn, foedselsdato, fornavn, id, kjoenn, fylke, parti) VALUES(‘%s’,’%s’,’%s’,’%s’,’%s’,’%s’,’%s’,’%s’);""" % (doedsdato, etternavn, foedselsdato, fornavn, repr_id, kjoenn, fylke, parti))

return cur

def create_schema(cur):
cur.execute("DROP TABLE IF EXISTS perioder")
perioder = "CREATE TABLE  perioder(fra varchar(255), id varchar(255), til varchar(255))"
cur.execute("DROP TABLE IF EXISTS sesjoner")
sesjoner = "CREATE TABLE sesjoner(fra varchar(255), id varchar(255), til varchar(255))"
cur.execute("DROP TABLE IF EXISTS hovedemner")
hovedemner = "CREATE TABLE hovedemner(id int, navn varchar(255));"
cur.execute("DROP TABLE IF EXISTS underemner")
underemner = "CREATE TABLE underemner(id int, navn varchar(255), hovedemne_id int)"
cur.execute("DROP TABLE IF EXISTS fylker")
fylker = "CREATE TABLE fylker(id varchar(255), navn varchar(255));"
cur.execute("DROP TABLE IF EXISTS partier")
partier = "CREATE TABLE partier(id varchar(255), navn varchar(255));"
cur.execute("DROP TABLE IF EXISTS komiteer")
komiteer = "CREATE TABLE komiteer(id varchar(255), navn varchar(255));"
cur.execute("DROP TABLE IF EXISTS representanter")
representanter = "CREATE TABLE representanter(doedsdato varchar(255), etternavn varchar(500), foedselsdato varchar(255), fornavn varchar(500), id varchar(255), kjoenn varchar(255), fylke varchar(255), parti varchar(255));"
cur.execute(perioder)
cur.execute(sesjoner)
cur.execute(hovedemner)
cur.execute(underemner)
cur.execute(fylker)
cur.execute(partier)
cur.execute(komiteer)
cur.execute(representanter)
return cur

if __name__ == "__main__":
conn = sqlite3.connect(DATA)
cur = conn.cursor()
cur = create_schema(cur)
cur = get_perioder(cur)
cur = get_sesjoner(cur)
cur = get_emner(cur)
cur = get_fylker(cur)
cur = get_partier(cur)
cur = get_komiteer(cur)
cur = get_representanter(cur)
conn.commit()
conn.close

[/sourcecode]

Bildet er tatt av Kjell Jøran Hansen og lisensiert under en Creative Commons lisens. Bildet er funnet igjennom Flickr

Screen Scraping med Python

Screen Scraping med Python

Tim Berners-Lee, Internetts pappa, snakker i et inspirerende TED foredrag om the web of data, og oppfordrer alle til å dele sine data. Ideen med web of data er at webben fram til i dag hovedsakelig har vært dokumentbasert og at denne nå endres til å bli mer databasert. I webbens dokumenter ligger det mye bakenforliggende informasjon hvor det semantiske innholdet ofte kan være utfordrende å hente ut, spesielt hvis du er en datamaskin. Selv om, ideeltt sett, all informasjon som ligger ute på Internett burde være lesbar for alle er det ikke slik, så av og til må vi være litt pragmatiske og gjøre det beste ut av situasjonen. Vi må rett og slett skrape ut den informasjonen – de dataene vi trenger – fra websiden, derav navnet screen scraping.

Et av de morsomste aspektene ved screen scraping, syntes jeg, er å forstå hvordan innholds-serveren er satt sammen og dekonstruere denne. Ettersom få organisasjoner (og personer for den saks skyld) håndkoder hver HTML-side de viser til offentligheten ligger det et system bak sidene som publiseres. Hvordan hele nettstedet er organisert og hvordan hver enkelt side er lagt opp er et godt utgangspunkt for å forstå strukturen. Mange nettsteder følger i dag også en REST-arkitektur, og dette gjør at vi kan finne ut mye bare av å se på URLen vi bruker for å få tilgang til sidens forskjellige funksjoner. Mange slike sider er laget med populære rammeverk med REST-implementering slik som Ruby on Rails og Django, som også har tilrettelagt for en enkel implementasjon av data-grensesnitt slik at spørringer kan resultere i et dataformatert svar som for eksempel XML eller JSON imotsetning til et template-basert HTML svar. Hvis du kan spørre om et datasvar kan du muligens finne dataene dine uten å scrape. Uansett, i mange tilfeller et klipp-og-lim eller filopplastning fra Excel eller Word klippet inn i et web-grensesnitt eller endret programmatisk til HTML, eller data hentet fra et annet datasystem matet til publiseringsportalen også i HTML. I disse tilfellene trenger vi å tweeke og endre litt for å hente ut data.

Et slag for demokratiet (?)

Francis Boulle, kanskje mest kjent fra TV-serien Made in Chelsea, har laget siden sexymp. På denne siden blir du presentert med bilder av to representanter fra det britiske underhuset. Din oppgave er å klikke på den mest attraktive personen, og basert på dine og andres klikk rangeres representantene. Dette systemet er nærmest identisk med siden som Mark Zuckerberg setter sammen i åpningsscenen i filmen The Social Network fra 2010. Det kan være spennende nok å se på vektingsalgoritmen og prøve å forstå hvordan Chris Evans fra Islwyn blir ansett som den mest og Steve McCabe fra Birmingham den minst attraktive, men for vår undersøkelse er det mer interessant å se hvordan vi kan innhente navn og bilder dersom vi ønsker å lage en tilsvarende side for Norge.

Stortinget.no

Et bra sted å begynne dersom vi skal utvikle en norsk sexymp (eller et annet atributt som f.eks troverdighet, ærlighet, classyness osv.) er Stortinget.no. Hvis vi går inn i listen over representanter kan vi finne ut at URLen har følgende oppbygning: http://stortinget.no/no/Representanter-og-komiteer/Representantene/Representantfordeling/Representant/?perid=SMY (for Sverre Myrli fra Arbeiderpartiet) og tilsvarende for en annen representant: http://stortinget.no/no/Representanter-og-komiteer/Representantene/Representantfordeling/Representant/?perid=IME (Ine Marie Eriksen fra Høyre). Hvis du studerer disse URLene nøye, hva er det som det er verdt å merke seg? … Ja, nemlig, representantene har en ID.

La oss nå gå inn å se på hvordan Ine Marie Eriksen sin profil egentlig ser ut. Siden nettleseren din, når du åpner en side, kun laster ned et dokument med angitte referanser skjer det ikke noe magisk (okey, med web 2.0 med Ajax og masse Javascript, eller Flash, Silverlight og andre plug-ins kan det virke mer mysterisk eller være veldig vanskelig å titte inn i boksen). I alle nettlesere kan du få tilgang til råteksten som senere skaper siden, og hvis du går inn å ser på denne kan du finne ut hvordan siden er bygget opp og hvilke andre ressurser (som f.eks bilder, stylesheets, javascript-kode) som kalles. Forsøk nå å se på HTML-filen til Eriksens stortingsbio og se om du kan finne bilde-ressursen.

Der ja, på linje 472 finner vi følgende HTML-snutt:
<img id="ctl00_MainRegion_RepShortInfo_imgRepresentative" src="/Personimages/PersonImages_Large/IME_stort.jpg" alt="Søreide, Ine M. Eriksen" style="border-width:0px;"
Dersom du følger denne filbanen med stortingsdomenet slik at URLen du benytter er: http://stortinget.no/Personimages/PersonImages_Large/IME_stort.jpg vil bare bildet åpnes. Du har nå funnet bildet som hentes inn i bio-siden vi tidligere åpnet. Nå kan du forsøke å åpne bildet til Sverre Myrli ved hjelp av kun hans system-ID som er SMY og URLen til Ine Marie Eriksen.

Hvis du nå har byttet “IME_stort.jpg” med “SMY_stort.jpg” har du forstått hvordan vårt detektivarbeid har bært frukter, og at vi nå kan få tak i alle bildene dersom vi har alle IDene. Vi trenger nå to ting: En liste over alle representantene med deres respektive IDer, og en måte å laste ned bildene til vår lokale maskin. Det er på tide å scripte litt.

Python

Det vi skal gjøre nå kan du gjøre i alle multi-purpose programmeringsspråk, men jeg har valgt å bruke Python. Jeg syntes Python er et flott språk å bruke til å løse små oppgaver hvor det er mye prøving og feiling involvert. Det finnes også veldig mye dokumentasjon og litteratur om Python i forskjellige roller, det kommer med mange nyttige biblioteker som kan brukes “right out of the box”, og det er allerede installert i de fleste Linux-distribusjoner og på alle Macer.

Jeg har lagt ut forskjellige kodeeksempler på GitHub, og du kan finne kildekoden vi bruker videre i denne lille artikkelen der. Dersom du ikke har installert Python eller mangler noen av bibliotekene jeg benytter finnes det mange gode guider på nettet på hvordan du kan få fikset dette.

Som tidligere nevnt er en av de første tingene vi trenger en liste over alle representantene. Heldigvis for oss finnes det en side med en slik liste. På siden med URL http://stortinget.no/no/Representanter-og-komiteer/Representantene/Representantfordeling/ finner vi en komplett liste over alle representantene med navn, parti, stortingsnummer og ID. Hvis vi åpner siden i en nettleser får vi ikke opp IDen, men denne finner vi dersom vi istedet ser på kildekoden. Dette er nok fordi IDen ikke er så viktig for de fleste besøkende, men for stortingssystemet er essensiell og derfor er den viktig for oss også. For å finne den informasjonen vi trenger før vi kan gå videre må vi gå inn å ta en nærmere titt på HTML-kilden.

Her kan du se at representantlista er lagt inn i en tabellstruktur. tr-taggen er table row og td-taggen er table data. Nedover kan vi se at tabellstrukturen er fast så vi kan ta utgangspunkt i denne, men i scriptet findPerId.py  har vi tatt utgangspunkt i et annet gjengående mønster. Alle representantoppføringene har en link-struktur som følger et bestemt mønster.
Amundsen, Per-Willy

Det at vi har et bestemt mønster bestående av “Representanter-og-komiteer/Representantene/Representantfordeling/Representant/”, etterfulgt av et spørsmålstegn som igjen følges av en HTML closing bracket (“>) og så to tekststrenger separert med et komma (,) og at dette mønsteret ikke finnes på andre steder enn akkurat der hvor vi ønsker å finne data er bra for oss. Dette kan vi nemlig representere med regulære utrykk

(r"Representanter-og-komiteer/Representantene/Representantfordeling/Representant/\?perid=(\w*)\">([a-å]*), ([a-å]*)", re.IGNORECASE)

Vi kan selvsagt hente hele HTML filen fra Stortingets nettsider og kjøre scriptet på all tekst, men hey, nettleseren din parser HTML så det kan jo vi gjøre også. Etter vi har lastet ned dokumentet via HTTP-protokollen med urllib2 sender vi HTML-teksten til BeautifulSoup som gjør at vi kan hente ut alle anchor-taggene og begrense RegEx sjekken vår til disse. Mange av linkene vil feile mønsteret vårt, og det er jo litt av poenget ettersom verken “tilbake til forsiden” eller “rettighetsinformasjon” er representanter vi ønsker å hente ut fra systemet. Siden vi har gruppert treffene i tre grupper ved hjelp av paranteser vil vi kunne hente ut henholdsvis brukernavn, etternavn og fornavn. Etter vi har hentet ut all informasjonen vi trenger skriver vi denne til en kommaseparert fil (populært referert til som CSV) som vi da kan hente inn i vårt neste script.

Laste ned bilder programmatisk

Etter å ha kjørt findPerId.py i forrige avsnitt har vi nå en lagret CSV-fil med fornavn, etternavn og ID på stortingsrepresentantene. Nå trenger vi bare å iterere igjennom denne listen og laste ned bilder for våre folkevalgte. Scriptet som gjør denne jobben heter downloadPictures.py og dette benytter metoden urlretrieve fra urllib biblioteket til å forespørre hvert enkelt bilde, laste dette ned og plassere det i en folder vi oppretter med os.makedirs (hvis ikke folderen vår allerede eksisterer). Bildene vi skal laste ned har før og etter representant IDen følgende tekst:

path_big_pre = "http://stortinget.no/Personimages/PersonImages_Large/"
path_big_post = "_stort.jpg"
path_little_pre = "http://stortinget.no/Personimages/PersonImages_Small/"
path_little_post = "_lite.jpg"

Som du sikkert kan forstå fra disse variabelnavnene og tekststrengene er at det for hver representant er både et stort og et lite bilde. Metoden urlretrive tar to argumenter hvorav det første må være den absolutte filbanen, altså med IDen samt teksten før og etter i URLen, outfolderen kan benytte en relativ filbane så her kan du endre filnavnet dersom du ønsker det.

Veien videre

Når du har kjørt begge scriptene har du både navn, id og bilder av alle stortingsrepresentantene, og hvis du har lest teksten og googlet ukjente navn og fremgangsmåter har du også lært noe nytt. Lykke til med videre bruk og andre eksperimenter.

Dette innlegget er skrevet på UiOs stand under The Gathering 2012 på Hamar, en myldreplass for kreativitet. 

Analysing the Bible

Analysing the Bible

The computer is a good tool in many areas but within its defining field, computations, it is great. With over a million computations per second even a big, large and heavy book (in its physical manifestation) can be sorted in just a blink of an eye. A while ago I tried to sort the King James version of the Bible.

Inspiration

During the last few years you may have encountered Jonathan Feinberg’s Wordle. This visualisation of word frequency in text has been popular in conveying  writing patterns, showing established key terms, especially from texts where users have been expressing them selves in just single words (describe this BRAND with five adjectives). Which words we use when we express ourselves are important, the statistical frequency can give us an indication of important topics, trends, values etc, it can also convey how languages change over time.

The Process

I chose to apply a relatively new language, both in the terms of computer history and my computer skills: Python. Python is a flexible language, which is said to come with “batteries included”, in other words, much functionality is available in the standard library. Python does also come with a live interpreter and many different frameworks are supported through portations. The logic of my little program is quite easy. It can very crudely be divided into five steps: 1) read the text file 2) for each word create if no previous occurrence is found or iterate counter 3) sort the occurrences according to the frequency 4) print the total numbers of words with frequency and word, separate frequency and word with comma and words with newline.

[cc lang=”python”]

#!/usr/bin/python

from string import maketrans
import operator
import sys

if len(sys.argv) <2:
print “Error: Please provide a textfile as argument”
sys.exit(1)
else:
textfile = sys.argv[1]

words = {}
outtab =”                             ”
intab = “,.;:#[]()?!0123456789&<>-‘\n\t\””
transtab = maketrans(intab, outtab)

try:
linestring = open(textfile, ‘r’).read()
linestring = linestring.translate(transtab).lower()
items = linestring.split(‘ ‘)

except Exception:
print “Error: Could not open file.”
sys.exit(1)

for item in items:
if item in words:
words[item] = words[item] + 1
else:
words[item] = 1

sorted_words = sorted(words.iteritems(), key=operator.itemgetter(1))
f = open(textfile+”out.txt”,”w”)
t = open(“testfile.test”,”w”)

for k, v in sorted_words:
print k,v
t.write(k+” “+str(v)+”\n”)
f.write(k+”,”+str(v)+”\n”)

print “The total amount of words in “+  textfile + ” is “+str(len(words))

[/cc]

 

The code is more complex than the five steps explain above.  The code gets the file-path to the text from an argument following the program name in the terminal, and it does also print simple error messages in case anything should not work.

Findings

The Swiss linguist Ferdinand de Saussure (1857 – 1913) divided language into langue and parole, French for language and speech, but where the first is the impersonal, social structure of signs, and the latter the personal phenomenon of language as speech acts. An example can be found in the game of chess. The simple structures defining the rules of the game can easily be understood, but the usage of these rules is what gives the game its complexity. Let us use this distinction while analysing  the outfile of the program above.

Parole: The Bible is an interesting text. The last two thousand years the book has been taken for law and a life guide for many millions of people, and even today religious texts are used as legislation in a few countries in the world, and as a rule for how some live and organise their lives. The whole tradition of hermeneutics began with the study of interpretation of religious texts, and also wars have been fought over the analysis and the subsequent execution of actions described explicitly or implicitly. Our little test does not rely on semantic interpretation, but see what you will interpret from these words:

Love: 318
Hate: 87
Jesus: 990
God: 4531
Satan: 57
Jerusalem: 816

Langue: When Samuel Morse tried to make an efficient language for transferring messages over the wire in the 19th century, he looked to the English language and its use to find out how a message can be sent efficiently. To do this he went to typographers to see of which font cases they had the most. The morse language (getting so popular that we today can use it as a generic name) is constructed with a short dot corresponding to ‘e’, and a long dash corresponding to ‘t’. These are the most frequent letters in the English language. So how to write ‘z’ or ‘y ‘, letters that are less frequently used? ‘Y’ is dash-dot-dash-dash, and ‘z’ is represented by dash-dash-dot-dot. You may at this point guess what the most frequent occurrences of this little program brought. Here is the 20 most frequent words used:

them,6514
him,6695
not,6727
is,7119
be,7188
they,7490
lord,7990
a,8438
his,8563
i,8868
unto,9041
for,9130
shall,9851
he,10517
in,12891
that,13229
to,14048
of,35312
and,52167
the,64926

Some of the largest occurrences are removed since they had no semantic value. Before sorting and processing several characters were replaced with whitespace and everything was lowercased.

This shows us that the most frequent words are in fact the small words having a more structuring function: preposition, articles, conjunctions. We can also see that the world ‘lord’ is on the “top-20” list, and this may be related to the subject role the lord plays in many biblical sentences e.g. the lord said, the lord told etc.

Program-wise is there still potential for improvement in the program I wrote. It seems to be a parsing error causing a small group of the occurrences to be printed in a not standard format. They are written with a comma before the words.

Yesterday I received the book Visualizing Data by Ben Fry, one of the creators of Processing, so hopefully I will get some visual representations of data up and running soon.

If you want a copy of the counted and sorted file, that can be found here.

 


The Article Picture is named Bibles, and is the property of GeoWombats. The picture is licensed with Creative Commons and acquired through Flickr. Please refer here for more information.