API-scrape images from Instagram

An image can say more than a thousand words, especially when you add a retro-filter and a score of hashtags to go with it. That is a basic explanation to the functionality of Instagram; the power app which revolutionised peoples creativity when it came to documenting dietary habits… and popularised images in social media.

Instagram brought social into photography in a way other more desktop-oriented photo sharing applications like Picasa and Flick never managed. It is social and users can like and comment on other’s pictures. Instagram also enhances images by reducing their photographic clarity (let’s emulate cameras far less technologically advance by adding a filter), but then again, this adds the style to images, and makes some of them look cool. However, I will let the filters and pastulation (the digital emulation of an analogue past – coined by moi, but please let me know if there is a buzzword for this and I may conform) rest for now. Let us instead focus – pun intended – on something else: the tagging of images, and how to retrieve images with a certain tag.

Adding context and indices with tags

An instagram picture may be tagged with several hashtags, these are words or concatenated words prepended with an hashtag #. The great thing with # is that they are 1) providing a social signifier for the users that this is a tag and hence, the users can use this tag to organise their content and create a social context for which the photo exists e.g. #instafood (picture of a meal), #selfie (a person taking a picture of him/herself usually in together with..), #duckface (quack quack pouting) and #onedirection (popular teenage idols). Tags can be of any kind, from current affair, to more general stuff. 2) providing a token to declare something indexable for the Instagram-server and other technical resources. Once the computer system knows it’s a tag it may group the tags together, perform analysis on the tag and users associated with this tag, aggregate statistics on the tag and other stuff to enhancing the user experience. In our case the tagging is great as we want to retrieve images with a given tag.

The #InstagramTagGetterScript

Below is a script which takes the tagname as an argument and downloads the images and meta-data associated with these images. To get it to work you will need to obtain an API-key from Instagram’s developer page. This URL you can put into the inital request sent to the server (that being stored into the next_url variable). We are using the tags-endpoint to download the images.

The rough outline of the script is as follows: 

First we define a class to store each InstaEntry, and this class comes with the functionality to retrieve and store the image and metadata, as well as dump the data to disk and load the data from disk. The class holds all the variables we are interested in collecting, and once instantiated these variables are set unless they do not exist with the image.

Once the structure is created some inital parameters are set: the tag and our initial URL-request, and the folders into where we will store data are created. When everything is set up we run a loop which continues to run as long as there are data available and we get responses with HTTP 200-status (OK). The loop instantiates an InstaEntry for each image which then download images as well as metadata on the fly. The objects are retained until the program is fully executed, but all large data (see images) are downloaded directly and not kept in memory.

Please contact me if you want to use this script, tailor it, or have any questions related to it.

#!/usr/bin/ruby
# encoding: UTF-8

require 'active_support'
require 'restclient'
require 'csv'
require 'open-uri'
require 'fileutils'

class InstaEntry
  attr_accessor :id, :username, :picture_url, :likes, :filter, :location, :type, :caption, :tags, :fullname, :user_id, :created_time, :link

  def initialize(id)
    @id = id
    @@threads = []
  end

  def marshal_dump
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def marshal_load(variables)
    @id = variables[0]
    @username = variables[1]
    @picture_url = variables[2]
    @likes = variables[3]
    @filter = variables[4]
    @location = variables[5]
    @type = variables[6]
    @caption = variables[7]
    @tags = variables[8]
    @fullname = variables[9]
    @user_id = variables[10]
    @created_time = variables[11]
    @link = variables[12]
  end

  def to_arr
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def self.get_image(obj,tag)
    @@threads << Thread.new(obj,tag) {
      begin
        open("images_#{tag}/#{obj.id}_#{obj.username}_.#{obj.picture_url.match('\.(jpe?g|gif|png)')[1]}","wb") do |file|
          file << open("#{obj.picture_url}").read
        end
      rescue
        puts "ERROR: #{obj.id} triggered an Exception in get_image method"
      end
    }
  end

  def self.print_metadata(obj,tag)
    open("md_#{tag}/#{@id}_#{@username}.txt","wb") do |file|
      file.print(obj.to_arr)
    end
  end

end #end InstaEntry class

#
# This block sets the parameters, and reads the first word for keyboard to be file
#

raise ArgumentError, "Missing name of tag to download" if ARGV.length < 1

$tag = ARGV[0]

output = open("output.json","wb")
next_url = URI::encode("https://api.instagram.com/v1/tags/#{$tag}/media/recent?access_token=51998418.d146264.e77441adc4a04399874a19b48bb91e71f&min_id=1")
# NB: The access token above is similar to a token, but obfuscated. Get your own by retrieving a developer account at Instagram.
puts next_url

unless File.directory?("md_#{$tag}")
  FileUtils.mkdir_p("md_#{$tag}")
end

unless File.directory?("images_#{$tag}")
  FileUtils.mkdir_p("images_#{$tag}")
end

count = 0
instas = {}

#
# This blocks run through all the subsequent pagination pages. Stop when stumbles upon HTTP code not being 200 or if the access string is shorter or like 5 characters.
#
begin
  response = RestClient.get(next_url)
  json = ActiveSupport::JSON.decode(response)
  pretty_json = JSON.pretty_generate(json)
  puts "Status code #{json['meta']['code']} for URL #{next_url}.. Fetching"
  next_url = json['pagination']['next_url']
  sleep 2

# loop through the data elements
json['data'].each do |item|
  puts item['link']
  puts item['user']['full_name']
  ie = InstaEntry.new(
    item['id'])
  instas[item['id']] = ie

  ie.username = item['user']['username']
  ie.picture_url = item['images']['standard_resolution']['url']
  ie.likes = item['likes']['count']
  ie.filter = item['filter']
  ie.location = item['location']
  ie.type = item['type']
  ie.caption = item['caption']['text'] unless item['caption'].nil? or item['caption']['text'].nil?
  ie.tags = item['tags']
  ie.fullname = item['user']['full_name']
  ie.user_id = item['user']['id']
  ie.created_time = item['created_time']
  ie.link = item['link']

  InstaEntry.get_image(ie,$tag)
  InstaEntry.print_metadata(ie,$tag)
end

count += 1

output << pretty_json

puts "Now checked __ #{count} __ files and __#{instas.length}__ number of instas"
puts "*****Ending with #{count} __ files and __#{instas.length}__ number of instas****" if next_url.nil?

end while not next_url.nil?

output.close

File.open("instadump_#{$tag}",'wb') do |f|
  f.write Marshal.dump(instas)
end

CSV.open("output_#{$tag}.csv", "wb", {:col_sep => "\t"}) do |csv|
  instas.each do |k,v|
    csv << instas[k].to_arr
  end
end

 

 

Disclaimer: Enabling you to download images associated with tags does not make me say that you can do whatever you want to. First, please refer to the Instagram guidelines to confirm that you are actually allowed to download images. Second, respect the individual users privacy and immaterial content rights, do not use images in a publishing context without the users consent. Generally: be nice, and do good. 

Create an online webservice fast with Sinatra and Heroku

So you want to explore the posibitilies of the web, and you want to look the static HTML files served from a web server. Here is a short introduction to how you get a dynamic webserver up and running as fast as it gets utilising the Sinatra web-framework and the hosting provider Heroku. Heroku has already have gained cred in the developer society, and Facebook is now recommending Heroku as PaaS (Platform as a Service) when you create services. It is super-easy to get started, and comes with lots of instant gratification.

Heroku_serious_mem

Sinatra and Rails

Sinatra is a lightweight web-framework using Ruby with Rack. I have earlier looked at Ruby on Rails, which is also a great framework based on the same platform. The main difference is the complexity of the two systems. Rails comes with a great Model-View-Controller structure, database mapper and lots of neat stuff, but sometimes you just want simplicity, and Sinatra got simplicity written all over. Yes, you may do a lot of advance things in Sinatra, and you can make something really simple in Rails, but why not harness the nature of these two frameworks. If you are going to make something complex and wants a strict convention-over-configuration approach go for rails, if you just want something really simple, clean and clear let’s get started with Sinatra.

Installation

I expect you to have ruby and it’s gem package version system installed. If not you will have to install these. I would recommend you to also look into RVM, so that you can have more ruby versions installed with their gems, and in this manner retain the software versions of your platform and gems when you upgrade other applications.

All you have to do is download the package ‘sinatra’. Just “gem install sinatra” into your shell and et voila. You now have installed Sinatra. You can verify this by opening the ruby in an interactive shell with the command ‘irb’, and then require the package “require ‘sinatra'” in the irb-terminal.

You will also need to install heroku. The process is as easy as installing sinatra. Just type “gem install heroku” into the shell.

You would also need to install git – the version controll system – and sign up for Heroku.

Installing Git

The version control system system Git is gaining a lot of momentum. The website Github is based around git, and Heroku uses it for pushing local files to the web server. A fun fact: Git has also inspired a TED speech by Clay Shirky. You can download Git from here.

Sign up for Heroku

Just like installing Git is easy, signing up for Heroku is even easier. Sign up through Heroku’s website and follow the instructions. You will also need to download a heroku gem which makes interaction through the terminal quick and easy.

You also need to create a pair of SSH-keys and exchange these with the server for authentication when you are uploading files. Here you find a tutorial for creating the keys.

Creating a Sinatra application skeleton

What makes Sinatra a very convenient tool is how lightweight it is, and still expendable. When you start you only need a couple of files, and from there you expand the application to accomodate the level of complexity you need. I like to use Sinatra with dead simple small applications.

To begin creating a Sinatra app, first create an empty folder. I use the computers Command Line Interface to use this (on a Mac this is called Terminal and can be found in the Utility folder). You can create a new folder by using mkdir command with the name of the folder-to-be as the only argument: ‘mkdir myfirstsinatraapp’

Navigate into this folder by using the cd command. For the most convenient *nix commands please see this introduction.

Put the Sinatra application under Git version control

In the folder, initialise git with the command “git init”. You now have an empty git repository.

Create the app

The only thing you will need to begin with is a config.ru, Gemfile file and an app.rb file. An easy way to create these are with the “touch filename” command where filename is the name of the desired file.

In the config.ru file, require the app.rb file, and call the “run Sinatra::Application” command. The file should look like this:

[sourcecode language=”ruby”]
require ‘./app’
run Sinatra::Application
[/sourcecode]

The other file you would need to create is the app.rb. This file would act as your main file for your application. Where you can generate routes and fill them logic.local

At the top of the file, require sinatra and create a route to the index path:

[sourcecode language=”ruby”]
#!/usr/bin/env ruby
#encoding: UTF-8

require ‘sinatra’

get ‘/’ do
"Hello world"
end
[/sourcecode]

The ‘/’ represents the root of your web-app, and once you go to this page you will be prompted with the text “Hello world”. Don’t take my word for it, try it yourself.

You will also need a Gemfile, in which you list the gems used by your application and where to get them. Since our application is super-simple we only need to list Sinatra. After writing and saving the gemfile. Install the gems by typing ‘bundle install’.

[sourcecode language=”ruby”]
source ‘https://rubygems.org’
gem ‘sinatra’
[/sourcecode]

In the folder with your files, type “ruby app.rb“, this will start a web-server with your application. The default location is localhost:3456, but this will be prompted in the terminal. Open a web browser, and type in the address and port-number to your web-server. You should now see the plain text “Hello world”.

A very simple output
A very simple output

Commit and push your application to Heroku

To publish your application to Heroku you will need to add all files to the empty git repository we created above. Add all files with the command “git add .” (The dot represents the current directory). Check that all files are staged with “git status“. Commit the changes with “git commit -m ‘wohoo. going live’ “.

Now, all the code is in the git repository and ready to be uploaded to Heroku. Write “heroku create” into the shell. After the command is executed a URL to your service on heroku should have been created and displayed by the command, the command should also have created a remote repository in your git file. This is transparent to you as a user, but you can verify by typing ‘more .git/config’. A line with ‘[remote “heroku”]’ and a couple of lines under this should have been added.

When this is done, you are ready to push your code onto Heroku. Type ‘git push heroku master‘ to make your local code available from the internet. As you add files and alter code, add these to the git, commit and push them to Heroku.

Screenscrape av Øya-programmet

Forskningsprosjektet Sky & Scene, hvor jeg jobber, ser blant mye annet nærmere på strømmetallene fra WiMP før, under og etter Øya-festivalen. For å gjøre dette trenger vi en liste over hvilke artister som spiller, hvilken dag de spiller og når på dagen de spiller. Før dataene kan analyseres må disse dataene være tilgjengelige i Excel-ark og i CSV-format og i databasen hvor strømmetallene finnes. Dataene må hentes og struktureres i et bestemt format.

Et godt utgangspunkt er å samle dataene i et CSV-format. CSV står for Comma separated values, kommaseparerte verdier, og er en liste hvor verdiene er for en forekomst er samlet på en linje, og hvor forekomstens data-attributter, også kalt variabler, er separert med – you guessed it – komma. Et lignende format kan du finne i Excel hvor èn forekomst finnes på èn linje, og denne forekomstens variabler oppgis i kolonner.

Finne dataene

Ok, nok om formatering. Hvor kan vi finne dataene? Et naturlig utgangspunkt er festivalens hjemmesider. På oyafestivalen.com (den engelske hjemmesiden til festivalen) finner vi et menyvalg kalt “program“, og her finner vi også programmet.

developer_menu_oyafestival
Utviklerverktøyet til Chrome kan finnes i menyen. Dette er et veldig nyttig verktøy for både web-utvikling og screen scraping

For å screen scrape programmet hjelper det lite med den visuelle presentasjonen av siden og vi må derfor se på HTML kilden. I Google Chrome finner du denne ved å høyreklikke i web-vinduet for så å klikke på “vis sidekilde”, her kan vi finne HTML-koden. Eventuelt kan du kopiere denne lenken inn i din Chrome browser: “view-source:http://oyafestivalen.com/program/#all

Dersom du gikk inn i kildekoden vil du se at listen med artister mangler. Hvorfor? Jo, fordi listen er ganske lang og benyttes av flere kilder lastes ikke listen med programmet inn av selve program-siden. Den lastes inn asynkront med AJAX (Asynchronous Javascript and XML). Finn fram Chrome Developer Tools som finnes i menyen, og gå til Network fanen. Last siden igjen ved å klikke på sirkelen med pil til venstre for URL-feltet.

Her kan du se at en fil kalt getArtist.php er lastet (bilde 1), og at denne filen ikke lastes som en del av originalforespørselen vår til web-tjeneren, men istedet er lastet inn via Javascript. Dersom vi klikker for å se på hva denne URL-en leverer kan vi se at artistlisten kommer herifra. URLen til siden kan du finne ved å høyreklikke på navnet getArtist.php og så velge “copy link address”.

Når du har URLen (http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php) kan du kopiere denne inn i nettleser vinduet ditt. Du skal nå få en liste uten spesiell formatering som ser omtrent slik ut:

artistliste_oyafestival
Øyafestivalens artistliste hentes fra serveren asynkront for å spare tid når hovedsiden lastes. Nå har vi funnet dataene vi trenger.

OK, nå har vi funnet dataene vi trenger. Nå må vi bare finne en god måte å hente de ut fra siden. La oss ta en titt på kilden bak konsertlista. Her finner vi både dataene og strukturen vi trenger:

Dataene vi trenger, men med en annen formatering. Uansett, nå gjenstår bare hentingen og reformateringen.
Dataene vi trenger, men med en annen formatering. Uansett, nå gjenstår bare hentingen og reformateringen.

Her kan vi se at:

  1. Ytterst har vi en div-tag med klassen “table title”. Denne innleder forklaringen som står over kolonnen i visningen.
  2. Vi har en uordnet liste (ul-tag) med klassen “table”
  3. Den uordnede listen har flere barn som er satt i liste elementer (li). Disse benytter seg av HTML5 data-attributter, men disse skal vi ikke bruke i denne omgang.
  4. Hvert liste-element har et span element med klassen “name”, hvor innholdet er navnet på artisten
  5. Liste-elementet har også en klasse “scene” med scene navnet som innhold.
  6. Sist har liste-elementet også en “date” klasse med de tre første bokstavene på dagen, tre non breaking spaces (HTML syntaks: &nbsp;) og tidspunkt for konsert-start.

Her finner vi alle dataene, og formateringen er også lik for alle elementene i lista med klassen “table”.

Når vi nå har funnet datakilden kan vi begynne å trekke ut dataene for videre bruk.

Screen scrape med Ruby og Nokogiri

Vi har nå funnet kilden og da kan vi benytte oss av Ruby og biblioteket (ruby-term: gem) Nokogiri.

Før vi begynner å hente dataene må vi gjøre klart scriptet som skal hente dataene fra festivalens hjemmeside. Vi inkluderer nokogiri som skal hjelpe oss å parsere datakilden. Samtidig laster vi også inn csv-bibliotek for å skrive ut filene og open-uri for å kunne lese URI-kilden som en fil.

[sourcecode language=”ruby”]
#!/usr/bin/ruby
# -*- encoding : utf-8 -*-

require ‘nokogiri’
require ‘open-uri’
require ‘csv’
[/sourcecode]

Konsert klassen

For å lagre og manipulere dataene lager vi en klasse for å lagre de fire verdiene vi trenger: artist, scene, date og datetime. Hos kilden finner vi de tre første verdiene og datetime konstruerer vi utfra date.

For klassen setter vi alle variablene vi skal benytte med en attr_accessor. Dette gjør at ruby selv genererer get og set-metoder for alle variablene listet etter funksjonen, noe som gjør at vi fritt kan hente og sette variablene fra instansene av klassen.

Vi skriver en initialize-metode, en konstruktør, som kalles når instansen opprettes. Siden vi allerede henter artist, scene og dato fra datakilden kaller vi konstruktøren med disse variablene slik at disse settes. For å oversette date til datetime, lager vi en dictionary med dagene og tilsvarende ISO-datoformat.

Legg merke til at når instans-variabelen @date settes, så gjøres det en del formatering. Fra kilden får vi datoformatet noe annerledes, så vi fjerner non-braking space, og bytter ut punktum med semikolon og sørger for at det er mellomrom mellom de tre bokstavene som angir dagen, og klokkeslettet. Når dette er gjort kaller vi en metode for å generere datetime-verdien basert på date-verdien. Vi bruker @ foran variabelnavnet for å markere at dette er en instanse-variabel.

metoden add_datetime gjør et oppslag i date_dict og bytter ut dag-bokstavene med ISO-dato, deretter henter den ut tidspunktet fra @date variabelen og interpolerer disse to verdiene til en datetime string.

Den siste metoden vi lager to_arr tar alle instanse-variablene og returnerer disse som en array. Siden CSV-funksjonen vi inkluderte tidligere kan lage en CSV-linje fra en array er dette en hendig måte å hente ut verdiene fra objektet.

[sourcecode language=”ruby”]
class Concert
attr_accessor :artist, :scene, :date, :datetime
def initialize(artist, scene, date)
@date_dict = {‘wed’ => ‘2013-08-07′ ,’thu’ => ‘2013-08-08′ ,’fri’ => ‘2013-08-09′ ,’sat’ => ‘2013-08-10’}
@artist = artist.strip
@scene = scene.strip
@date = date.gsub(/\u00a0/, ”).gsub(‘.’,’:’).gsub(/([a-zA-Z]{3})(.)/,’\1 \2′).strip
self.add_datetime
end

def to_arr
return [self.artist, self.scene, self.date, self.datetime]
end

def add_datetime
@datetime = "#{@date_dict[@date[0,3].downcase]} #{@date[4..9]}"
end

end
[/sourcecode]

Lese dokumentet, hente ut dataene og lage objektene

Når vi nå har en datastruktur hvor vi kan lagre informasjonen, kan vi begynne å hente informasjonen fra internett. Aller først lager vi igjen en tom dictionary hvor vi ønsker å lagre våre konsert-objekter etterhvert som vi lager disse.

Vi bruker Nokogiris HTML klasse og lagrer denne til doc variabelen. Til denne sender vi en tekst-strøm som hentes fra URLen. Vi sender altså samme tekst som vi fikk fra getArtist.php kildekoden til Nokogiri.

Nokogiri har en utmerket methode kalt css. Denne metoden tar en CSS (Cascading Style Sheet) selektor og finner riktig element fra DOMen (Document Object Model) som Nokogiri holder. Vi ønsker å iterere over alle “.table li”-nodene (alle li-nodene under table-klassen), og gjør dette ved enkelt med .each metoden.

For hver “.table li” vi itererer over, henter vi ut innholdet av elementene som har klassene .name, .scene og .date og oppretter et objekt av Concert-klassen. Det siste vi gjør for hver iterasjon er å lagre objektet med artisten som nøkkel i vår concerts dictionary.

[sourcecode language=”ruby”]
concerts = {}

doc = Nokogiri::HTML(open(‘http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php’))
doc.css(‘.table li’).each do |el|
a = Concert.new(el.css(‘.name a’).first.content,
el.css(‘.scene’).first.content,
el.css(‘.date’).first.content)
concerts[a.artist] = a
end
[/sourcecode]

Printe objektene som CSV

Når vi har opprettet alle objektene ønsker vi å skrive ut alle variablene i disse til fil. Vi gjør dette ved å åpne en fil kalt output.csv med skrivetilgang. Deretter itererer vi igjennom alle objektene og bruker nøkkelen fra k-variabelen til å hente ut hvert enkelt objekt som finnes i vår concerts dictionary. For å kun få Øya-festivalens konserter (ikke klubb-Øya) sjekker vi at konserten fant sted på enten scenene “Enga”, “Klubben”, “Sjøsiden” eller “Vika” (Sjøsiden har feil format her som vi senere korrigerer i Excel). For hvert objekt hvis scene er inkludert blant Øya-scenene skrives det en linje til csv-fila som tar en array med verdier. Denne arrayen hentes fra to_arr metoden vi skrev i Concert-klassen.

[sourcecode language=”ruby”]
CSV.open("output.csv", "wb") do |csv|
concerts.each do |k,v|
csv << concerts[k].to_arr if [‘Enga’,’Klubben’,’Sjøsiden’,’Vika’].include? concerts[k].scene
end
end

[/sourcecode]

Sånn. Nå burde du ha en CSV med alle Øya-artistene som du kan enten importere til en database eller åpne i Excel.

Hele scriptet:

[sourcecode language=”ruby”]

#!/usr/bin/ruby
# -*- encoding : utf-8 -*-

require ‘nokogiri’
require ‘open-uri’
require ‘csv’
require ‘open-uri’

class Concert
attr_accessor :artist, :scene, :date, :datetime
def initialize(artist, scene, date)
@date_dict = {‘wed’ => ‘2013-08-07′ ,’thu’ => ‘2013-08-08′ ,’fri’ => ‘2013-08-09′ ,’sat’ => ‘2013-08-10’}
@artist = artist.strip
@scene = scene.strip
@date = date.gsub(/\u00a0/, ”).gsub(‘.’,’:’).gsub(/([a-zA-Z]{3})(.)/,’\1 \2′).strip
self.add_datetime
end

def to_arr
return [self.artist, self.scene, self.date, self.datetime]
end

def add_datetime
@datetime = "#{@date_dict[@date[0,3].downcase]} #{@date[4..9]}"
end

end

concerts = {}

doc = Nokogiri::HTML(open(‘http://oyafestivalen.com/wp-content/themes/oya13_new/includes/ajax/program/getArtists.php’))
doc.css(‘.table li’).each do |el|
a = Concert.new(el.css(‘.name a’).first.content,
el.css(‘.scene’).first.content,
el.css(‘.date’).first.content)
concerts[a.artist] = a
end

CSV.open("output.csv", "wb") do |csv|
concerts.each do |k,v|
csv << concerts[k].to_arr if [‘Enga’,’Klubben’,’Sjøsiden’,’Vika’].include? concerts[k].scene
end
end

[/sourcecode]

What does your Twitter followers look like

I like Twitter. It’s the virtual world’s answer to Post-it notes, well not really, but the nature of the site constrains people from droning on-and-on about a topic. The restrictions in the number of characters a user may put into a tweet causes brevity, which is ideal when the total number of people who you follow is increasing and the live-feed of Tweets is updated several times per minute. This restrictions in the number of characters have fostered a certain repurposing of signs. The @ (at-sign) is working as a reference to users and the # (hash-sign) is used to concentrate commentaries about a topic.

From a programming-wise perceptive the decoupled nature of Twitter is interesting. While Facebook.com and other Facebook-created applications are the favourite place to access Facebook-data, Twitter is built around a more open model. Facebook is complicated with several ways of interacting, many modalities, third-party applications and a requirement to login to get access. Twitter is opposite, it has very simple and easy to understand structure and is open (though user may protect their feed to accepted followers this is not default and luckily not widely spread). These differences leads, at least for me, to Facebook being a personal and private social network and Twitter being used more for interests, both professional and private.

The decoupled nature of Twitter has led to an abundance of third-party applications which interact with the Twitter API. It’s easy to access Twitter data, and more importantly, easy to understand what it all does. Twitter is in that sense a good place to start if you want to play more around with data, Internet protocols and programming. A couple of posts ago I wrote about Mining the Social Web and Screen Scraping with Python (in Norwegian). The post you now is reading goes into the same category. This code example does not however mine any of the data from Twitter, it is just finding the ID of your followers, then get data about each of them and download their pictures. The code here may be used for further data-analysis though if you save the data gathered by the requests to somewhere instead of doing like I do store them in the heap while the script executes and hence automatically remove them when it terminates. Writing the data to file takes just two lines of code, writing them to a database probably four or five.

[sourcecode language=”ruby”]

#encoding: utf-8

dependencies = %w(net/http active_support open-uri uri)
dependencies.each {|m| require m}

class Follower
attr_accessor :name, :created_at, :profile_image_url, :location, :url, :lang, :geo_enabled, :description
def initialize(id)
@id = id
end

def say_hi
puts "Hello, my name is #{@name}, and I have the id #{@id}. Oh, by the way. I was created at #{@created_at}"
end

def download_picture
puts "downloading #{@id} : #{@name} from #{@profile_image_url} \n"
unless @profile_image_url.nil?
open(URI.escape(@profile_image_url)) {|f|
File.open("pictures/#{@id}.jpg","wb") do |file|
file.puts f.read
end
}
end

end

end

class TwitterGetter

def initialize(name)
@name = name
#@followers = Array.new
unless File.directory?("pictures")
Dir.mkdir("pictures", 0755)
end
end

def get_follower_list
response = Net::HTTP.get("api.twitter.com" , "/1/followers/ids.json?cursor=-1&screen_name=#{@name}" )
@followers = ActiveSupport::JSON.decode(response)
sleep_time = ((60*60)/ 75) + 2
puts "set sleeptime to: ", sleep_time, "\n"

@followers[‘ids’].each do |id|
sleep(sleep_time)
lookup_user(id)
end

end

def lookup_user(id)
response = Net::HTTP.get("api.twitter.com", "/1/users/show.json?user_id=#{id}&include_entities=true")
info = ActiveSupport::JSON.decode(response)
f = Follower.new(id)
f.name = info["name"]
f.created_at = info["created_at"]
f.profile_image_url = info["profile_image_url"]
f.location = info["location"]
f.url = info["url"]
f.lang = info["lang"]
f.geo_enabled = info["geo_enabled"]
f.description = info["description"]
f.say_hi
f.download_picture
end
end

tg = TwitterGetter.new("olovholm")
tg.get_follower_list

[/sourcecode]

So, how does this little script work? First it instantiates the work-horse class TwitterGetter which takes the argument of the user which followers’ pictures you want to download. This class creates the directory in which the pictures will be downloaded into. Once instantiated we call the get_follower_list method which accesses twitter data through the REST-API  then parse the JSON stream from   response = Net::HTTP.get("api.twitter.com" , "/1/followers/ids.json?cursor=-1&screen_name=#{@name}" ). Once the list is downloaded a script runs through the new list of followers and gathers data from the Twitter API on each of the users. Due to limitation in how many requests one can call to twitter each hour I put the script to sleep for

total time units % requests allowed per hour

(You may request Twitter 150 times per hours, but the lookup user calls download_picture which downloads the image – It may be that this is excluded from the restrictions). Well, the user class stores data from each of the users and has the code and responsibility for downloading the users pictures. It also contains a say_hi method which perhaps would be better as a bit more formal to_s method. The code is more a proof of concept and does not contain error handling for JSON data returned by Twitter (which, believe me, happens quite often) so that would be a good place to start if you want to expand on this code.