API-scrape images from Instagram

An image can say more than a thousand words, especially when you add a retro-filter and a score of hashtags to go with it. That is a basic explanation to the functionality of Instagram; the power app which revolutionised peoples creativity when it came to documenting dietary habits… and popularised images in social media.

Instagram brought social into photography in a way other more desktop-oriented photo sharing applications like Picasa and Flick never managed. It is social and users can like and comment on other’s pictures. Instagram also enhances images by reducing their photographic clarity (let’s emulate cameras far less technologically advance by adding a filter), but then again, this adds the style to images, and makes some of them look cool. However, I will let the filters and pastulation (the digital emulation of an analogue past – coined by moi, but please let me know if there is a buzzword for this and I may conform) rest for now. Let us instead focus – pun intended – on something else: the tagging of images, and how to retrieve images with a certain tag.

Adding context and indices with tags

An instagram picture may be tagged with several hashtags, these are words or concatenated words prepended with an hashtag #. The great thing with # is that they are 1) providing a social signifier for the users that this is a tag and hence, the users can use this tag to organise their content and create a social context for which the photo exists e.g. #instafood (picture of a meal), #selfie (a person taking a picture of him/herself usually in together with..), #duckface (quack quack pouting) and #onedirection (popular teenage idols). Tags can be of any kind, from current affair, to more general stuff. 2) providing a token to declare something indexable for the Instagram-server and other technical resources. Once the computer system knows it’s a tag it may group the tags together, perform analysis on the tag and users associated with this tag, aggregate statistics on the tag and other stuff to enhancing the user experience. In our case the tagging is great as we want to retrieve images with a given tag.

The #InstagramTagGetterScript

Below is a script which takes the tagname as an argument and downloads the images and meta-data associated with these images. To get it to work you will need to obtain an API-key from Instagram’s developer page. This URL you can put into the inital request sent to the server (that being stored into the next_url variable). We are using the tags-endpoint to download the images.

The rough outline of the script is as follows: 

First we define a class to store each InstaEntry, and this class comes with the functionality to retrieve and store the image and metadata, as well as dump the data to disk and load the data from disk. The class holds all the variables we are interested in collecting, and once instantiated these variables are set unless they do not exist with the image.

Once the structure is created some inital parameters are set: the tag and our initial URL-request, and the folders into where we will store data are created. When everything is set up we run a loop which continues to run as long as there are data available and we get responses with HTTP 200-status (OK). The loop instantiates an InstaEntry for each image which then download images as well as metadata on the fly. The objects are retained until the program is fully executed, but all large data (see images) are downloaded directly and not kept in memory.

Please contact me if you want to use this script, tailor it, or have any questions related to it.

#!/usr/bin/ruby
# encoding: UTF-8

require 'active_support'
require 'restclient'
require 'csv'
require 'open-uri'
require 'fileutils'

class InstaEntry
  attr_accessor :id, :username, :picture_url, :likes, :filter, :location, :type, :caption, :tags, :fullname, :user_id, :created_time, :link

  def initialize(id)
    @id = id
    @@threads = []
  end

  def marshal_dump
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def marshal_load(variables)
    @id = variables[0]
    @username = variables[1]
    @picture_url = variables[2]
    @likes = variables[3]
    @filter = variables[4]
    @location = variables[5]
    @type = variables[6]
    @caption = variables[7]
    @tags = variables[8]
    @fullname = variables[9]
    @user_id = variables[10]
    @created_time = variables[11]
    @link = variables[12]
  end

  def to_arr
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def self.get_image(obj,tag)
    @@threads << Thread.new(obj,tag) {
      begin
        open("images_#{tag}/#{obj.id}_#{obj.username}_.#{obj.picture_url.match('\.(jpe?g|gif|png)')[1]}","wb") do |file|
          file << open("#{obj.picture_url}").read
        end
      rescue
        puts "ERROR: #{obj.id} triggered an Exception in get_image method"
      end
    }
  end

  def self.print_metadata(obj,tag)
    open("md_#{tag}/#{@id}_#{@username}.txt","wb") do |file|
      file.print(obj.to_arr)
    end
  end

end #end InstaEntry class

#
# This block sets the parameters, and reads the first word for keyboard to be file
#

raise ArgumentError, "Missing name of tag to download" if ARGV.length < 1

$tag = ARGV[0]

output = open("output.json","wb")
next_url = URI::encode("https://api.instagram.com/v1/tags/#{$tag}/media/recent?access_token=51998418.d146264.e77441adc4a04399874a19b48bb91e71f&min_id=1")
# NB: The access token above is similar to a token, but obfuscated. Get your own by retrieving a developer account at Instagram.
puts next_url

unless File.directory?("md_#{$tag}")
  FileUtils.mkdir_p("md_#{$tag}")
end

unless File.directory?("images_#{$tag}")
  FileUtils.mkdir_p("images_#{$tag}")
end

count = 0
instas = {}

#
# This blocks run through all the subsequent pagination pages. Stop when stumbles upon HTTP code not being 200 or if the access string is shorter or like 5 characters.
#
begin
  response = RestClient.get(next_url)
  json = ActiveSupport::JSON.decode(response)
  pretty_json = JSON.pretty_generate(json)
  puts "Status code #{json['meta']['code']} for URL #{next_url}.. Fetching"
  next_url = json['pagination']['next_url']
  sleep 2

# loop through the data elements
json['data'].each do |item|
  puts item['link']
  puts item['user']['full_name']
  ie = InstaEntry.new(
    item['id'])
  instas[item['id']] = ie

  ie.username = item['user']['username']
  ie.picture_url = item['images']['standard_resolution']['url']
  ie.likes = item['likes']['count']
  ie.filter = item['filter']
  ie.location = item['location']
  ie.type = item['type']
  ie.caption = item['caption']['text'] unless item['caption'].nil? or item['caption']['text'].nil?
  ie.tags = item['tags']
  ie.fullname = item['user']['full_name']
  ie.user_id = item['user']['id']
  ie.created_time = item['created_time']
  ie.link = item['link']

  InstaEntry.get_image(ie,$tag)
  InstaEntry.print_metadata(ie,$tag)
end

count += 1

output << pretty_json

puts "Now checked __ #{count} __ files and __#{instas.length}__ number of instas"
puts "*****Ending with #{count} __ files and __#{instas.length}__ number of instas****" if next_url.nil?

end while not next_url.nil?

output.close

File.open("instadump_#{$tag}",'wb') do |f|
  f.write Marshal.dump(instas)
end

CSV.open("output_#{$tag}.csv", "wb", {:col_sep => "\t"}) do |csv|
  instas.each do |k,v|
    csv << instas[k].to_arr
  end
end

 

 

Disclaimer: Enabling you to download images associated with tags does not make me say that you can do whatever you want to. First, please refer to the Instagram guidelines to confirm that you are actually allowed to download images. Second, respect the individual users privacy and immaterial content rights, do not use images in a publishing context without the users consent. Generally: be nice, and do good.