Load database from CSV with columns and tables read from file.

The issue at hand here is to load a database with CSV-files in a generic way without having to write separate script for each table.

This short Groovy-script is working based on two conventions:

  • The name of the CSV-file matches the name of the table in the database.
  • The name of the CSV-header matches the name of the column in the table.

The script requires the CSV-files to be read in the folder “to_digest”.

The connection details is placed in a property file, which can be added to the git ignore file if the script is meant to be under GIT version control. In this example configuration variables are stored in the file ‘configuration.properties’ which needs to be in the same folder as the script is executed from.

The property-file contains the following lines:

db_url=jdbc:oracle:thin:@yourdbserver:1521/schema
db_user=yourusername
db_password=yourpassword

Please also note that this script is written for Oracle, and that the Oracle drivers are available in the local Nexus repository as these are not available on the open Internet. This can easily be changed to another database vendor. If you choose to use an Open Source database vendor such as MySQL or Postgres the drivers are available in public maven repositories and you will not need the reference to a local repository through the @GrabResolver annotation.

 

@GrabResolver(name='nexus', root='http://yourorganizationsnexusinstance.you', m2Compatible = 'true')
@GrabConfig(systemClassLoader=true)
@Grapes([
        @Grab(group='com.oracle', module='ojdbc8', version='12.2.0.1'),
        @Grab(group='com.opencsv', module='opencsv', version='3.9')
])


import groovy.sql.*
import com.opencsv.*

import static groovy.io.FileType.FILES


def path = new File("to_digest")
path.traverse(type : FILES, nameFilter: ~/\w*\.csv/) { it ->
    List entries = digestFile(it)
    insertFile(entries.subList(0,1)[0],entries.subList(1,entries.size()),it.name.take(it.name.lastIndexOf('.')))

}


private List digestFile(def path){

    CSVReader reader = new CSVReader(new FileReader(path),(char)';')
    List myEntries = reader.readAll()
    return myEntries
}


private void insertFile(def header, def data, def name){
Properties properties = new Properties()
File propertiesFile = new File('configuration.properties')
propertiesFile.withInputStream {
    properties.load(it)
}
    println name
    Sql conn = Sql.newInstance(properties.db_url,properties.db_user,properties.db_password)
    data.each { rows ->
            String columns = String.join(",", header)
            String values = rows.collect {"'$it'"}.join(",")
            String query = "INSERT INTO ${name} (${columns}) VALUES (${values})"
            conn.execute(query)
    }
}

Lovholm.net back with new layout

In recent years the activity on lovholm.net has decreased.. unfortunately. Hopefully, I will soon be able to kick-start some new articles, as I have picked up a few new things I want to share with the world over the last few years.

As you may have noticed if you been visiting this page earlier, it does not look quite like it did a few months or five years ago. The last larger update of the page (WordPress engine updates excluded) was back in 2011 (six years ago!? times goes fast) while I was doing my studies in Edinburgh. As the world moves on, I saw it fit to change the visual template, update a few plugins, and do a clean install of the WordPress installation.

That being said: Welcome to the new page!

It may be some glitches, there may be some source code not correctly formatted, and pictures may be missing. Please don’t hesitate to contact me if some content on the page, which you want to read, does not render properly or there are missing pictures etc. Although I will try and go through the most vital parts of the page, the updates and checking are usually done sporadically.

SQL: Setup a Learning Workbench with SQLite and SQuirreL

Install SQLite

To download SQLite you only need to download a file from the project homepage of SQLite. Select download, and precompiled download for your prefered operating system.

Unzip the folder with the binary, and place it in a folder that is accessible from your command line interface (The terminal on Mac/*nix systems and Powershell or cmd on Windows). This is a folder that is in the path variable. Here is how you can add a folder to the path on several operating systems.

Open the terminal, and type sqlite3.exe.

Once you have installed SQLite in a folder accessible from the path, you should get this prompt when starting the program.
Once you have installed SQLite in a folder accessible from the path, you should get this prompt when starting the program.

 

Download sample dataset

The next step is to download the data we are going to use. Luckily for us, there are many sample databases available online. One of these is the Chinook database. Download the dataset from here. Unzip the database, and find the ‘Chinook_Sqlite.sqlite’ file. Move this to a folder, and start sqlite3.exe with the path to the database as argument. j

Validate the downloaded database by displaying the tables it contain (and then, if you want, run a query to see some content – just for fun.)

Star SQLite with the database, display tables in the database and run a query (text marked in yellow is input).
Star SQLite with the database, display tables in the database and run a query (text marked in yellow is input).

Download SQuirrel SQL, a SQL query tool

You have done great! We now have a functional database with both SQLite and some data we can use. We could stop now, and do what we want with SQLite, but for us to work better it would have been great having a tool beside the CLI to interact with the database.

SQuirreL is just such a tool. An Open Source SQL-workbench with support for several database-systems including MySQL, Oracle, DB2 and SQLite. In this GUI tool you can navigate the database, run queries, display graphs of how tables are related and more.

Find the installation files here. Select the build approprate for your operating system, run the file downloaded and run through the installation.

Connect SQuirreL with your database

Eu Wern Te has written a short guide to how you can install a JDBC-driver for SQLite, and connect SQuirreL to the sqlite-database. To summarize, you will need to download the JDBC-driver, add this to SQuirreLs lib-folder. When this is done: open SQuirreL, add the driver and then make an alias where you in the URL-field insert the path to the SQLite-database. You will need to fill out several fields, so check out the guide.

Once you have connected SQuirreL with the datasbase, you can connect and communicate with the database through this tool.

 

SQuirreL_and_SQLite

Bestille togreise i India

Å bestille en togreise i India kan være litt tricky, men er absolutt verdt å gjøre. Det er et enormt tognettverk mellom de forskjellige byene og regionene, og stor spennvidde i hvor god plass og service du får og hvor lang tid togene benytter. Ikke minst er det en flott måte å se landet mellom byene.

Under vårt India-opphold benyttet vi Rajdhani (betyr hovedstad) mellom Mumbai og New Delhi og fikk på førsteklasse egen lugar, te, kaffe og tre-retters middag. Mellom New Delhi og Agra reiste vi med 2. klasse og hadde ingen egen lugar, men fikk egne senger og kunne begrense innsyn ved hjelp av gardin. Når du skal bestemme deg for togavgang og klasse er det to ting det kan være lurt å tenke på:

  1. Prioritet i jernbanenettet. Noen avganger får høyere prioritet enn andre, og reiser derfor raskt og enkelt igjennom verdens mest komplekse nettverk av ruter og stasjoner. For eksempel går Rajdhani-avgangene alltid på tid, og har lite forsinkelser og problemer.
  2. Komfortnivå. Det er mange kupé og billettklasser på indiske tog. Skal du reise et par timer kan det være morsomt og rimelig å reise i en enklere billettkategori, men dersom du skal overnatte eller reise langt kan det være godt å ha større plass i en luftkondisjonert vogn.

Bestillingsprosedyre

Da vi bestilte togreiser i India gjorde vi det igjennom Cleartrip. Det kan ofte være stor interesse for enkelte togavganger og pågangen er stor, derfor kan det være lurt å bestille billett god tid i forveien, men med en søkemotor, slik som Cleartrip, kan du finne alternative ruter dersom hovedforbindelsen er utsolgt. Det kan ta et par dager å bestille første reise igjennom Cleartrip ettersom du trenger en verifisering av e-post og telefonnummer for å kunne gjennomføre et kjøp igjennom det indiske togselskapet IR sitt reisebyrå IRCTC. For å gjennomføre kjøp av våre billetter gikk jeg igjennom denne gode guiden hos Indiamike. Det gode med denne gjennomgangen og Cleartrip er at når du har fått knyttet din Cleartrip-konto til IRCTC så vil du kunne bestille nye togbilletter uten store problemer. Det kjedelige med denne fremgangsmåten er at den krever at du har et pass og kan scanne dette.

Dersom du skulle ha noen problemer, ta kontakt.

A short script for testing writing many files to a folder

The challenge: We want to see when the number of files in a folder decrease the performance on adding new files into the same folder. Two examples where we may need to do to this are: to get an overview of the performance of the file system node structure, or to test Windows function for 8dot3 format compatibility.

The solution: We want to create a script that writes a large amount of files to the folder in question and is logging the time taken at specific milestones. The records logged from the execution of this script can give us a time on how long it takes to write the number of files until the milestones are reached, and from this we can infer how efficient the file system is at writing files between the different milestones.

Example of output
A graph representing the number of files created over time. The X axis convey the number of seconds elapsed, and the Y axis the number of files created. How does your function look like?

The implementation: I’ve chose to set the creation of new files in a for loop which runs N times based on user input. The loop will start, open a new file with an incremental file name, write the payload to the file, and finally close the file and increment the loop counter.

Wrapped around this core functionality, we need to define into which folder the files will be created, what data is to be read and written. We need to read the defined data into a variable (we don’t want to attach too much overhead by reading the data-to-write for each iteration), create a test-folder if this is not already excising. In addition we need a function to write the timestamp, and the iteration number to a file.

To open for multiprocessor testing I’ve also add a loop for spawning new processes and passing on the data about the number of files, and to test for more scenarios e.g. renaming and deleting files, more actions have been added.

The actions, the test folder path, the input file and the number of files and processors are something which the user most likely will change frequently, so instead of keeping this hard coded in the code this is branched out to be provided by the user as command line arguments. As always when dealing with command line arguments: provide good defaults, the user is often likely not to use all the parameters editable.

From description to code this will look something like this:

import time
import os
import string
import random
from multiprocessing import Process
import multiprocessing
import optparse
import os.path

def main(files_each=100, processes=10, actions="a", log_interval=100, temp_path="temp_files", infile="infile.txt"):
  path = temp_path
  check_and_create_folder_path(path)
  for i in range(processes):
    p = Process(target=spawnTask, args=(path, files_each, actions, log_interval, infile))
    p.start()

def print_time_delta(start_time, comment, outfile=False):
  if not outfile:
    print(comment," | ",time.time() - start_time, " seconds")
  else:
    with open(outfile, 'a+') as out:
      out.write("{0} | {1} \n".format(time.time() - start_time, comment))

def spawnTask(path,files_each, actions,log_interval, infile):
  start_time = time.time()
  content = read_file_data(infile)

  print_time_delta(start_time,"creating files for process: "+str(os.getpid()))
  created_files = createfiles(files_each, content,path,start_time, log_interval)
  if(actions == 'a' or actions == 'cr'):
    print_time_delta(start_time,"renaming files for process: " +str(os.getpid()))
    renamed_files = rename_files(created_files,path,start_time, log_interval)
  if(actions == 'a'):
    print_time_delta(start_time,"deleting files for process: "+str(os.getpid()))
    delete_files(renamed_files,path,start_time, log_interval)

  print_time_delta(start_time,"operations have ended. Terminating process:"+str(os.getpid()))

def createfiles(number_of_files, content,path,start_time, log_interval):
  own_pid = str(os.getpid())
  created_files = []
  for i in range(number_of_files):
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"create","prod_log.txt")
      filename = "wordfile_test_"+"_"+own_pid+"_"+str(i)+".docx"
      created_files.append(filename)
      with open(path+"\\"+filename,"wb") as print_file:
        print_file.write(content)

  print_time_delta(start_time, str(number_of_files) +" | "+own_pid+" | "+"create","prod_log.txt")

  return created_files

def rename_files(filenames,path,start_time, log_interval):
  new_filenames = []
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"rename","prod_log.txt")
      lst =[random.choice(string.ascii_letters + string.digits) for n in range(30)]
      text = "".join(lst)
      os.rename(path+"\\"+file,path+"\\"+text+".docx")
      new_filenames.append(text+".docx")
      i += 1

  print_time_delta(start_time, str(len(new_filenames))+" | "+own_pid+" | "+"rename","prod_log.txt")

return new_filenames

def delete_files(filenames,path,start_time, log_interval):
  num_files = len(filenames)
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"delete","prod_log.txt")
      os.remove(path+"\\"+file)
      i += 1
      print_time_delta(start_time, str(num_files)+" | "+own_pid+" | "+"delete","prod_log.txt")

def check_and_create_folder_path(path):
  if not os.path.exists(path):
    os.makedirs(path)

def read_file_data(infile):
  with open(infile,"rb") as content_file:
    content = content_file.read()
  return content

if __name__ == "__main__":
  multiprocessing.freeze_support()
  parser = optparse.OptionParser()
  parser.add_option('-f', '--files', default=100, help="The number of files each process should create. Default is 100")
  parser.add_option('-p', '--processes', default=10, help="The number of processes the program should create. Default is 10")
  parser.add_option('-a', '--action', default='a', help="The action which the program should perform. The default is a.\n Opions include a (all), c (create), cr (create and rename)")
  parser.add_option('-l', '--log_interval', default=100, help="The interval between when a process is logging files created. Default is 100")
  parser.add_option('-t', '--temp_path', default="temp_files", help="Path where the file processes will be done")
  parser.add_option('-i', '--infile', default="infile.txt", help="The file which will be used in the test")

  options, args = parser.parse_args()
  main(int(options.files), int(options.processes), options.action, int(options.log_interval), options.temp_path, options.infile)

 

 

sample_from_output

The output from running this script will be a pipe separated (‘|’) list with seconds, number of files, the process ID (since we enable the program to spawn and run similar processes simultaneously we need to have a way to identify the processes) and actions. This will look like the image below, and from this number you can create statistics on performance at different folder sizes.

The idea of performing this analysis and valuable feedback in the process came from great colleagues at Steria AS.  Any issues, problems, responsibilities etc. with the code or text are solely my own. Whatever you use this information to do, try out or anything is solely your own responsibility.

The folder image is by Erik Yeoh and is released under a Creative Commons Attribution-NonCommercial-ShareAlike License. The image can be found on Flickr.

Two Good Tools for Peeking Inside Windows

Over the last couple of weeks I have explored the inner mechanics of Microsoft Windows, and the processes that run in this context. In this process two tools have proved especially useful: Xperf logs with WPA and Sysinternal’s Process monitor.

Xperf/WPA

During execution Windows is continuously surveying it’s internal processes through the Event Tracing for Windows (ETW). We may harness this internal surveying by extracting the data through the program Xperf. From specific points in the execution (which we decide when we initially start the logging) we can sample what is happening within the program and in the system as a whole.Two good features I have utilized from  Xperf/WPA combo are:

  •  Process Images We can which ‘.dll’-images are loaded by each process, and when they are loaded.
  •  We can view, over time, what system resources such as memory. CPU, and IO are used by each process or by the system as a whole.

Both these tools are included in the Windows Performance Toolkit which again is part of the Windows Assessment and Deployment Kit. You can during the assessment and deployment kit installation choose to install only the performance toolkit.

To record a session you need to call Xperf twice from the command shell: first to start the logging with specific flags to point out from which internal flags a sample should be made, then secondly to stop the logging and print the results to an .etl file.

A typical start command could be:

xperf -on latency -stackwalk profile

In this example xperf is called with the latency group in kernel mode which is looking at the following flags: PROC_THREAD+LOADER+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PROFILE. The stackwalk options provides a stack for the flags or group provided. For a complete list of kernel flags you can use the “xperf -providers k” command.

Once you have started Xperf and performed the action you wanted to record, you may stop xperf with this command

xperf -d output.etl

The -d option is explicitly telling xperf to append the logged session to the file output.etl (or create this file if not existing). The command also implicitly tells the logging session to stop.

For full overview over the commands accepted by Xperf, please refer to the Xperf options documentation at MSDN.

To analyze an .etl file, and the data that has been collected in the logging session, Microsoft has made available a good tool: Windows Performance Analyzer.

Windows Performance Analyzer is a part of the Windows Performance Toolkit.
Windows Performance Analyzer is a part of the Windows Performance Toolkit.

This neat tool provides small views for viewing the genral KPI for the resources to the left, and all of the main resources has expandable menus for more detailed views. Double clicking, or right clicking and selecting to open the section in the main window opens a more detailed overview in the right view of the application. Here the user can go into detailed depth of the applications. In the screenshot you can see the images loaded by the relatively simple command line application Ping.

Process Monitor

The Sysinternal toolkit contains many useful tools for various Windows-related tasks among others the ability to see the activities of a process over time. The latter is straight in the domain of the Process Monitor. With this convenient tool you can get an overview over what operations including registry queries, use of file resources and loading of images a process is doing.

With the Process Monitor you can surveying the system as a whole, and also filter for a specific process. The program traces many of the calls the program is making to the system, and you can use this trace to see in what sequence a program is executing and also which and what kind of system resources it relies on. The combination Xperf and WPA could give an good overview over the images loaded by a process, and with the Process Monitor you may expand this knowledge with Registry queries and Network calls, you can also look at when different profiling actions are called.

Process Monitor from the Sys internal Suite is a good tool to scrutinize what is happening with one or more process.
Process Monitor from the Sys internal Suite is a good tool to scrutinize what is happening with one or more process.

Process Monitor is used both for recording a trace, and for analyzing this afterwards. The traces can be saved to file. They can also be conveniently filtered through the filter functionality, either on the specific types of actions performed by a process (registry, file system, network resources, process and thread activity and profiling), using the symbols to the right of the menu. There is also a filter functionality, displayed in the image by the overlying window, here a good rule to make is to exclude all the actions not associated to the process which you want to survey.

Be advised that Process Monitor records a huge amount of actions. It can be a good idea to turn off recording when you not intend to record, and this can be achieved by toggling the magnifying glass in the menu.

An advantages of the programs in the Sysinternal toolkit, Xperf and WPA is that they do not need to be installed to work. All these tools can be put on a USB stick, and with some training you have suddenly become an one-man-army ready for examining the inside out of Windows.

The image used to illustrate this blog post is by Julian E…, it’s found through Flickr and shared under a Creative Commons by-nc-nd license. 

Work programmatically with Google Spreadsheets Part 2

A while back I wrote a short post on how you can write and read to Google Spreadsheets programmatically using Python and the package ‘gspread’.

Last time the reading was done by first creating arrays with the addresses to where the values could be found in the spreadsheet, and then run through all the values and replace the addresses with the values. It worked fine, but it’s not best practice or very efficient as it makes many single requests on the API. In part two, I will share a short tip on how to read the values in one go instead of iterating through a range of values.

Here is the excerpt dealing with retrieving values. (NB: see original blogpost for gspread initialization).

#Running through the values
get_val = []
set_name = []
set_country = []
for a in range(2,5124):
v = "B%s" % a
sn = "H%s" % a
sc = "G%s" % a
get_val.append(v)
set_name.append(sn)
set_country.append(sc)

for a in range(2,5124):
try:
name = worksheet.acell(get_val[a]).value
res = getCountry(name)
if res:
print res
country, last_id, name = res
worksheet.update_acell(set_name[a], name)
worksheet.update_acell(set_country[a], country)
except Exception as e:
print e

In a recent script we only wanted to download values from a Google spreadsheet (yes, we could have exported the files to .csv with similar result, but with a script we may expand and parse if needed), and this gave some time for refactoring the code as well.

The gspread function worksheet.get_all_values() returns a list of lists with the values. The outer list contains the rows, and the row list contains the specific value of the column at the numerical value for the column. In this example num_streams is the second column, and the position is hence [1] as the list starts at zero.

Also note the nifty way of writing utf-8 formatted strings to the file. UTF-8 can often cause an headache, but put a “u”-character before the string and open the stream with codecs.open(“filename”,”mode”,”encoding”).

The new way of retrieving data from a Google Docs Spreadsheet:

[sourcecode language=”python”]
# -*- coding: UTF-8 -*-
import gspread
import codecs

# Global variables for accessing resource
G_USERNAME = ‘user_email’
G_PASSWORD = ‘password’
G_IDENTIFIER = ‘spreadsheet_identifier’

# Connecting to the data source
gc = gspread.login(G_USERNAME,G_PASSWORD)
sht1 = gc.open_by_key(G_IDENTIFIER)
worksheet = sht1.get_worksheet(0)

all_val = worksheet.get_all_values()

output = codecs.open(‘output_norwegian_artists.csv’,’wb’, "utf-8-sig")

for l in all_val:
num_streams, artistid, name = (l[1],l[2],l[3])
norwegian = l[4]
if len(norwegian) < 3:
norwegian = ‘NULL’

string = u"{0};{1};{2};{3}\n".format(num_streams, artistid, name, norwegian)
output.write(string)

output.close()
[/sourcecode]

Picture licensed under a creative commons attribution license by the brilliant organization Open Knowledge Foundation. Picture retrieved through a CC-search on Flickr

API-scrape images from Instagram

An image can say more than a thousand words, especially when you add a retro-filter and a score of hashtags to go with it. That is a basic explanation to the functionality of Instagram; the power app which revolutionised peoples creativity when it came to documenting dietary habits… and popularised images in social media.

Instagram brought social into photography in a way other more desktop-oriented photo sharing applications like Picasa and Flick never managed. It is social and users can like and comment on other’s pictures. Instagram also enhances images by reducing their photographic clarity (let’s emulate cameras far less technologically advance by adding a filter), but then again, this adds the style to images, and makes some of them look cool. However, I will let the filters and pastulation (the digital emulation of an analogue past – coined by moi, but please let me know if there is a buzzword for this and I may conform) rest for now. Let us instead focus – pun intended – on something else: the tagging of images, and how to retrieve images with a certain tag.

Adding context and indices with tags

An instagram picture may be tagged with several hashtags, these are words or concatenated words prepended with an hashtag #. The great thing with # is that they are 1) providing a social signifier for the users that this is a tag and hence, the users can use this tag to organise their content and create a social context for which the photo exists e.g. #instafood (picture of a meal), #selfie (a person taking a picture of him/herself usually in together with..), #duckface (quack quack pouting) and #onedirection (popular teenage idols). Tags can be of any kind, from current affair, to more general stuff. 2) providing a token to declare something indexable for the Instagram-server and other technical resources. Once the computer system knows it’s a tag it may group the tags together, perform analysis on the tag and users associated with this tag, aggregate statistics on the tag and other stuff to enhancing the user experience. In our case the tagging is great as we want to retrieve images with a given tag.

The #InstagramTagGetterScript

Below is a script which takes the tagname as an argument and downloads the images and meta-data associated with these images. To get it to work you will need to obtain an API-key from Instagram’s developer page. This URL you can put into the inital request sent to the server (that being stored into the next_url variable). We are using the tags-endpoint to download the images.

The rough outline of the script is as follows: 

First we define a class to store each InstaEntry, and this class comes with the functionality to retrieve and store the image and metadata, as well as dump the data to disk and load the data from disk. The class holds all the variables we are interested in collecting, and once instantiated these variables are set unless they do not exist with the image.

Once the structure is created some inital parameters are set: the tag and our initial URL-request, and the folders into where we will store data are created. When everything is set up we run a loop which continues to run as long as there are data available and we get responses with HTTP 200-status (OK). The loop instantiates an InstaEntry for each image which then download images as well as metadata on the fly. The objects are retained until the program is fully executed, but all large data (see images) are downloaded directly and not kept in memory.

Please contact me if you want to use this script, tailor it, or have any questions related to it.

#!/usr/bin/ruby
# encoding: UTF-8

require 'active_support'
require 'restclient'
require 'csv'
require 'open-uri'
require 'fileutils'

class InstaEntry
  attr_accessor :id, :username, :picture_url, :likes, :filter, :location, :type, :caption, :tags, :fullname, :user_id, :created_time, :link

  def initialize(id)
    @id = id
    @@threads = []
  end

  def marshal_dump
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def marshal_load(variables)
    @id = variables[0]
    @username = variables[1]
    @picture_url = variables[2]
    @likes = variables[3]
    @filter = variables[4]
    @location = variables[5]
    @type = variables[6]
    @caption = variables[7]
    @tags = variables[8]
    @fullname = variables[9]
    @user_id = variables[10]
    @created_time = variables[11]
    @link = variables[12]
  end

  def to_arr
    [@id, @username, @picture_url, @likes, @filter, @location, @type, @caption, @tags, @fullname, @user_id, @created_time, @link]
  end

  def self.get_image(obj,tag)
    @@threads << Thread.new(obj,tag) {
      begin
        open("images_#{tag}/#{obj.id}_#{obj.username}_.#{obj.picture_url.match('\.(jpe?g|gif|png)')[1]}","wb") do |file|
          file << open("#{obj.picture_url}").read
        end
      rescue
        puts "ERROR: #{obj.id} triggered an Exception in get_image method"
      end
    }
  end

  def self.print_metadata(obj,tag)
    open("md_#{tag}/#{@id}_#{@username}.txt","wb") do |file|
      file.print(obj.to_arr)
    end
  end

end #end InstaEntry class

#
# This block sets the parameters, and reads the first word for keyboard to be file
#

raise ArgumentError, "Missing name of tag to download" if ARGV.length < 1

$tag = ARGV[0]

output = open("output.json","wb")
next_url = URI::encode("https://api.instagram.com/v1/tags/#{$tag}/media/recent?access_token=51998418.d146264.e77441adc4a04399874a19b48bb91e71f&min_id=1")
# NB: The access token above is similar to a token, but obfuscated. Get your own by retrieving a developer account at Instagram.
puts next_url

unless File.directory?("md_#{$tag}")
  FileUtils.mkdir_p("md_#{$tag}")
end

unless File.directory?("images_#{$tag}")
  FileUtils.mkdir_p("images_#{$tag}")
end

count = 0
instas = {}

#
# This blocks run through all the subsequent pagination pages. Stop when stumbles upon HTTP code not being 200 or if the access string is shorter or like 5 characters.
#
begin
  response = RestClient.get(next_url)
  json = ActiveSupport::JSON.decode(response)
  pretty_json = JSON.pretty_generate(json)
  puts "Status code #{json['meta']['code']} for URL #{next_url}.. Fetching"
  next_url = json['pagination']['next_url']
  sleep 2

# loop through the data elements
json['data'].each do |item|
  puts item['link']
  puts item['user']['full_name']
  ie = InstaEntry.new(
    item['id'])
  instas[item['id']] = ie

  ie.username = item['user']['username']
  ie.picture_url = item['images']['standard_resolution']['url']
  ie.likes = item['likes']['count']
  ie.filter = item['filter']
  ie.location = item['location']
  ie.type = item['type']
  ie.caption = item['caption']['text'] unless item['caption'].nil? or item['caption']['text'].nil?
  ie.tags = item['tags']
  ie.fullname = item['user']['full_name']
  ie.user_id = item['user']['id']
  ie.created_time = item['created_time']
  ie.link = item['link']

  InstaEntry.get_image(ie,$tag)
  InstaEntry.print_metadata(ie,$tag)
end

count += 1

output << pretty_json

puts "Now checked __ #{count} __ files and __#{instas.length}__ number of instas"
puts "*****Ending with #{count} __ files and __#{instas.length}__ number of instas****" if next_url.nil?

end while not next_url.nil?

output.close

File.open("instadump_#{$tag}",'wb') do |f|
  f.write Marshal.dump(instas)
end

CSV.open("output_#{$tag}.csv", "wb", {:col_sep => "\t"}) do |csv|
  instas.each do |k,v|
    csv << instas[k].to_arr
  end
end

 

 

Disclaimer: Enabling you to download images associated with tags does not make me say that you can do whatever you want to. First, please refer to the Instagram guidelines to confirm that you are actually allowed to download images. Second, respect the individual users privacy and immaterial content rights, do not use images in a publishing context without the users consent. Generally: be nice, and do good. 

Prepare Your Data For Analysis: Load Data With Spoon

Note: The instructions described here is learnt and utilized while working on the Clouds & Concerts project at University of Oslo in close collaboration with Xin Jian, Johannes Bjelland og Pål Roe Sundsøy under supervision by Arnt Maasø. The credit for methodology and SQL ninja tricks is therefore to be attributed to all members of the fantastic stream team. Inconveniences or mistakes are solely my own.

Technical set up

We chose to use Amazon as a solution for doing our data-analysis. Amazon Web Services provides both virtual computers, databases and storage in the cloud. All services are scalable so data loading and manipulation computer can be put dormant when idle, and the database can be scaled up when doing complex and data-intensive operations.

DSV files – choosing delimiter

An excerpt from a CSV file, note the tabulator distance between the occurrence's variables and lineshift between the occurrences.
An excerpt from a CSV file, note the tabulator distance between the occurrence’s variables and lineshift between the occurrences.

When moving large database files between two different servers in different organisations the standard way of exchanging data is through flat text files. The flat text files may have a structure e.g. XML or JSON, but in n the case of working with relational databases where all entries, or rows, have the same number of attributes, or columns, DSV is a convenient format. DSV is the abbreviation for delimiter separated values, and is a format where each attribute and each entry are separated with an assigned delimiter token. The most frequent delimiters are comma between attributes and newline between entries. This format is known as CSV.

When choosing delimiters, it is important to find a character that is not found in the data values. We chose to use tab as delimiter. To save space and transfer cost the files should be zipped (NB: the .zip-container supports only upto 4GB. If you have any larger files, use .gzip or .tar.gz).

Data delivery S3 bucket

Amazon’s Simple Storage Service (S3) is an easy way to retrieve data, either by using a client or using the web-interface accessible through the AWS console. All kinds of files may be uploaded to S3 including zipped or tar.gz-ed files. Files can easily be uploaded and downloaded to and from S3 by those who have access, and the web-uploader supports large files. You don’t have to define predefined storage space either since this is allocated on the fly.

Using Pentaho Spoon for loading

The software company Pentaho provides good tools for working with data-analysis, and part of their suite is the ETL-tool (Edit, transform, load) Spoon. This also comes in a community edition which can be downloaded from their homepage. The tool is built on Java so it can be run on different platforms. Once installed, use the terminal to navigate into the ‘data-integration’ folder, then set execution rights and run ‘spoon.sh’. This should open the spoon ETL-tool.

The Pentaho spoon workbench let you transfer data between "modules". In this picture a CSV input module is passing data to a table output module.
The Pentaho spoon workbench let you transfer data between “modules”. In this picture a CSV input module is passing data to a table output module.

In Penthao you can make a transformation for each table you need to transfer, alternatively you can join more transformations together in jobs. To do a basic transfer of data from CSV, or other delimited files, to a database server you will need to use two modules: CSV file input and table output.

Set up a super-easy transformation to load csv to table

In the Pentaho, the convenient module CSV file input can be used for reading data from delimited files. 1: Use the file finder to select the file, set the delimiter and enclosure (e.g. if the strings are prepended and appended by quote marks). If you have a file dump where the header row is present, tick of this option. 2: The CSV file input module can try and determine the format of the various tables based on the data in the files. Use the ‘Get Fields’ option, and select a good sample size to determine the format.

The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
Loading tip: represent values as strings.

If you have a file with a high risk of data errors, for example a file where the columns overflow, I usually set the fields to text, using the functionality where I can define variable data-type (3), and make sure the length is large enough to encompass the data. This enables us to do the changes, and alterations of data format on the database server once the data is loaded. Most data can be represented in a string. Both “12354”, “182,54” and “01/08/1986 00:00:00,000000” can be represented as a string, but the first example can also be an integer, the second a real or float, and the third a timestamp. We can parse the data when we have them loaded into our SQL-database. You can adjust the type in the CSV file input module.

Use Table Output to write read rows to SQL-server

The other Pentaho module needed for this basic transformation is the Table output. Once Pentaho is configured with valid database configuration, this module can access the database directly so you can select to which table to write the data. First drag a line between the CSV Input module to to the Table Output to get the data connection. Then select output database/target schema and target table . If you have setup the table structure through a DDL (The Data Definition Language – The structure of the tables) you can map the fields from the CSV file input. In the database field tab you first select ‘Get fields’ to get the fields found or created in the CSV input module. Then select either ‘Enter field mapping’ if you have the full DDL loaded on the database or ‘SQL’ to create the DDL followed by ‘Enter field mapping’ (see tip below). Once the table is connected and the fields from the input data is connected to the fields of the database you are ready to run the transformation. Return to the workbench and press the play symbol on the top of the transformation tab.

Table loading tip: Let Pentaho deal with DDL
Use Pentaho to help you fill out the table structure
Use Pentaho to help you fill out the table structure

Once you have defined a connection by making a line between the CSV file input and the Table output, you can load the fields as they are stored in the flat CSV file using the ‘Get fields’ button (1). If you already have a defined table definition you can go straight to the mapping (3), but if you don’t have a DDL or you want the database table structure to follow the field definition you declared in the CSV file input you can use the SQL button (2) to get Pentaho to create the DDL and execute this on the database. To be able to do this you first have to create a table for the ‘target table’ field to be filled out. The trick is to know that you only have to fill in one variable when you create the target table e.g. execute the command “CREATE TABLE ALMOST_EMPTY_TABLE (ID_COLUMN INTEGER)” on the database, and let Pentaho deal with creating the rest. After running the SQL, you can map the fields (3) and then run then transformation.

Validate that the data is loaded into the database.

Check the number of loaded rows

In Pentaho Spoon, the transformation will have a step matrix when running where you can monitor the progress. During the load and upon completion the total number of rows will show here. To be sure that all data is loaded, or to get a number of how many rows that was dropped you can read this from the step matrix, but also by double checking the number of lines in the CSV file and the number of rows in the database table. The number of lines in the CSV file can be found by using the command “wc -l filename.csv” in the folder where the CSV files are stored (if the file have a header subtract the result by one), and compare this to the number of rows in the loaded table using “SELECT count(*) FROM tablename”.

Your data is now loaded into the database.

Useful Terminal Commands for work with data

Please, see my terminal-tools tag for more articles on nifty terminal tools.

Enclose values in quotes, and add comma

Say you work with a large datafile where you get all values printed on consecutive lines, and you want to copy these lines into an array, list or other data structure where strings need to be quoted, and values needs to be separated by commas. Here is a sed script that may help you with that task.

sed 's/\(.*\)/"\1",/g' all_uuid.csv > all_ready.txt

What this CLI snippet does is to use the sed, stream editor, and pass in the s – substitute – argument followed first by a backslash delimiter and then by a eat all regular expression enclosed in a group. This reads the whole line and is finished by yet another backslash delimiter. What comes between the first and second delimiter is the input, and now comes the output. We write the initial quote mark, then a \1 which is referring to the regular expression group in the input, this is followed by a closing quote and a comma. We add the g-argument to continue this operation for all matches, in other words all lines. Then we pass in the file we want to alter, and sends in the > operator to print the output of this command into the file all_ready.txt.

Copy a specific number of lines from one file to another

While working with large data you want to try something new. The problem is that it takes too much time running the process on a too large data-selection. How can you cut down on the length without opening the text editor. With the head command you can print the beginning of a file, and by adding a number as argument you can define exactly how many lines to add. The lines are printed to stdout, but by using the greater than operator you can route this output into a new file.

head -50000 oldfile > newfile

The head command reads the file from the beginning and is usually reading a default number of lines. We pass in an argument asking for 50.000 lines then the file which we want to read, and pass in the greater-than operator to write the output into a new file.

Getting the total number of lines of a textfile

Working with data, and transfer from one system to another, you will often find that the data is saved onto flat files. Flat files can be comma separated (the well known CSV-files), tab-separated or separated by other means. The general pattern is that attributes of the data (the columns) are stored separated by a predefined delimiter, and the occurrences (the rows) are separated by a new line. For this command to work you need the format to be of this type. Use e.g. head to verify that occurences are separated by new lines. When that is done, run the command below, and if you have a header line in your flat file substract 1 from the number of lines. This can easily be compared with the result of count(*) on the SQL database after the data is imported.

wc -l filename

Getting the total number of files in a directory

In this example we use the command from above, but this time we pipe in the output of the directory list (ls) with the each file on a new line argument (-l). With piping we can send the result of one command as the argument for a second. This kind of chaining makes *nix systems very convenient as you can do complex stuff from combining simple commands. Anyhow, here the wc -l from above gets a list of files and directories in a directory and prints the count of these.

ls -l [directory-name] | wc -l

Illustrational image by Travis Isaacs, licensed under a Creative Commons attribution licence. Found on Flickr.