Browsed by
Tag: sql

SQL: Setup a Learning Workbench with SQLite and SQuirreL

SQL: Setup a Learning Workbench with SQLite and SQuirreL

Install SQLite

To download SQLite you only need to download a file from the project homepage of SQLite. Select download, and precompiled download for your prefered operating system.

Unzip the folder with the binary, and place it in a folder that is accessible from your command line interface (The terminal on Mac/*nix systems and Powershell or cmd on Windows). This is a folder that is in the path variable. Here is how you can add a folder to the path on several operating systems.

Open the terminal, and type sqlite3.exe.

Once you have installed SQLite in a folder accessible from the path, you should get this prompt when starting the program.
Once you have installed SQLite in a folder accessible from the path, you should get this prompt when starting the program.

 

Download sample dataset

The next step is to download the data we are going to use. Luckily for us, there are many sample databases available online. One of these is the Chinook database. Download the dataset from here. Unzip the database, and find the ‘Chinook_Sqlite.sqlite’ file. Move this to a folder, and start sqlite3.exe with the path to the database as argument. j

Validate the downloaded database by displaying the tables it contain (and then, if you want, run a query to see some content – just for fun.)

Star SQLite with the database, display tables in the database and run a query (text marked in yellow is input).
Star SQLite with the database, display tables in the database and run a query (text marked in yellow is input).

Download SQuirrel SQL, a SQL query tool

You have done great! We now have a functional database with both SQLite and some data we can use. We could stop now, and do what we want with SQLite, but for us to work better it would have been great having a tool beside the CLI to interact with the database.

SQuirreL is just such a tool. An Open Source SQL-workbench with support for several database-systems including MySQL, Oracle, DB2 and SQLite. In this GUI tool you can navigate the database, run queries, display graphs of how tables are related and more.

Find the installation files here. Select the build approprate for your operating system, run the file downloaded and run through the installation.

Connect SQuirreL with your database

Eu Wern Te has written a short guide to how you can install a JDBC-driver for SQLite, and connect SQuirreL to the sqlite-database. To summarize, you will need to download the JDBC-driver, add this to SQuirreLs lib-folder. When this is done: open SQuirreL, add the driver and then make an alias where you in the URL-field insert the path to the SQLite-database. You will need to fill out several fields, so check out the guide.

Once you have connected SQuirreL with the datasbase, you can connect and communicate with the database through this tool.

 

SQuirreL_and_SQLite

Prepare Your Data For Analysis: Load Data With Spoon

Prepare Your Data For Analysis: Load Data With Spoon

Note: The instructions described here is learnt and utilized while working on the Clouds & Concerts project at University of Oslo in close collaboration with Xin Jian, Johannes Bjelland og Pål Roe Sundsøy under supervision by Arnt Maasø. The credit for methodology and SQL ninja tricks is therefore to be attributed to all members of the fantastic stream team. Inconveniences or mistakes are solely my own.

Technical set up

We chose to use Amazon as a solution for doing our data-analysis. Amazon Web Services provides both virtual computers, databases and storage in the cloud. All services are scalable so data loading and manipulation computer can be put dormant when idle, and the database can be scaled up when doing complex and data-intensive operations.

DSV files – choosing delimiter

An excerpt from a CSV file, note the tabulator distance between the occurrence's variables and lineshift between the occurrences.
An excerpt from a CSV file, note the tabulator distance between the occurrence’s variables and lineshift between the occurrences.

When moving large database files between two different servers in different organisations the standard way of exchanging data is through flat text files. The flat text files may have a structure e.g. XML or JSON, but in n the case of working with relational databases where all entries, or rows, have the same number of attributes, or columns, DSV is a convenient format. DSV is the abbreviation for delimiter separated values, and is a format where each attribute and each entry are separated with an assigned delimiter token. The most frequent delimiters are comma between attributes and newline between entries. This format is known as CSV.

When choosing delimiters, it is important to find a character that is not found in the data values. We chose to use tab as delimiter. To save space and transfer cost the files should be zipped (NB: the .zip-container supports only upto 4GB. If you have any larger files, use .gzip or .tar.gz).

Data delivery S3 bucket

Amazon’s Simple Storage Service (S3) is an easy way to retrieve data, either by using a client or using the web-interface accessible through the AWS console. All kinds of files may be uploaded to S3 including zipped or tar.gz-ed files. Files can easily be uploaded and downloaded to and from S3 by those who have access, and the web-uploader supports large files. You don’t have to define predefined storage space either since this is allocated on the fly.

Using Pentaho Spoon for loading

The software company Pentaho provides good tools for working with data-analysis, and part of their suite is the ETL-tool (Edit, transform, load) Spoon. This also comes in a community edition which can be downloaded from their homepage. The tool is built on Java so it can be run on different platforms. Once installed, use the terminal to navigate into the ‘data-integration’ folder, then set execution rights and run ‘spoon.sh’. This should open the spoon ETL-tool.

The Pentaho spoon workbench let you transfer data between "modules". In this picture a CSV input module is passing data to a table output module.
The Pentaho spoon workbench let you transfer data between “modules”. In this picture a CSV input module is passing data to a table output module.

In Penthao you can make a transformation for each table you need to transfer, alternatively you can join more transformations together in jobs. To do a basic transfer of data from CSV, or other delimited files, to a database server you will need to use two modules: CSV file input and table output.

Set up a super-easy transformation to load csv to table

In the Pentaho, the convenient module CSV file input can be used for reading data from delimited files. 1: Use the file finder to select the file, set the delimiter and enclosure (e.g. if the strings are prepended and appended by quote marks). If you have a file dump where the header row is present, tick of this option. 2: The CSV file input module can try and determine the format of the various tables based on the data in the files. Use the ‘Get Fields’ option, and select a good sample size to determine the format.

The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
The Pentaho CSV input options lets you define the data-type of your field, it can also detect this automatically.
Loading tip: represent values as strings.

If you have a file with a high risk of data errors, for example a file where the columns overflow, I usually set the fields to text, using the functionality where I can define variable data-type (3), and make sure the length is large enough to encompass the data. This enables us to do the changes, and alterations of data format on the database server once the data is loaded. Most data can be represented in a string. Both “12354”, “182,54” and “01/08/1986 00:00:00,000000” can be represented as a string, but the first example can also be an integer, the second a real or float, and the third a timestamp. We can parse the data when we have them loaded into our SQL-database. You can adjust the type in the CSV file input module.

Use Table Output to write read rows to SQL-server

The other Pentaho module needed for this basic transformation is the Table output. Once Pentaho is configured with valid database configuration, this module can access the database directly so you can select to which table to write the data. First drag a line between the CSV Input module to to the Table Output to get the data connection. Then select output database/target schema and target table . If you have setup the table structure through a DDL (The Data Definition Language – The structure of the tables) you can map the fields from the CSV file input. In the database field tab you first select ‘Get fields’ to get the fields found or created in the CSV input module. Then select either ‘Enter field mapping’ if you have the full DDL loaded on the database or ‘SQL’ to create the DDL followed by ‘Enter field mapping’ (see tip below). Once the table is connected and the fields from the input data is connected to the fields of the database you are ready to run the transformation. Return to the workbench and press the play symbol on the top of the transformation tab.

Table loading tip: Let Pentaho deal with DDL
Use Pentaho to help you fill out the table structure
Use Pentaho to help you fill out the table structure

Once you have defined a connection by making a line between the CSV file input and the Table output, you can load the fields as they are stored in the flat CSV file using the ‘Get fields’ button (1). If you already have a defined table definition you can go straight to the mapping (3), but if you don’t have a DDL or you want the database table structure to follow the field definition you declared in the CSV file input you can use the SQL button (2) to get Pentaho to create the DDL and execute this on the database. To be able to do this you first have to create a table for the ‘target table’ field to be filled out. The trick is to know that you only have to fill in one variable when you create the target table e.g. execute the command “CREATE TABLE ALMOST_EMPTY_TABLE (ID_COLUMN INTEGER)” on the database, and let Pentaho deal with creating the rest. After running the SQL, you can map the fields (3) and then run then transformation.

Validate that the data is loaded into the database.

Check the number of loaded rows

In Pentaho Spoon, the transformation will have a step matrix when running where you can monitor the progress. During the load and upon completion the total number of rows will show here. To be sure that all data is loaded, or to get a number of how many rows that was dropped you can read this from the step matrix, but also by double checking the number of lines in the CSV file and the number of rows in the database table. The number of lines in the CSV file can be found by using the command “wc -l filename.csv” in the folder where the CSV files are stored (if the file have a header subtract the result by one), and compare this to the number of rows in the loaded table using “SELECT count(*) FROM tablename”.

Your data is now loaded into the database.

Safe functions for casting numbers and timestamp in Oracle SQL

Safe functions for casting numbers and timestamp in Oracle SQL

In the process of importing data which can be irregular from a flat file it is convenient to import all values as character strings, in this case Oracle’s varchar2, and then cast them into the proper format using the built-in datatype converters as e.g. to_number. The problem occurs when some value which is not convertible is stored in the column. This could be either because table column-overflow in the import process or because the value is optional and may be null or other non-set value. In this case the import methods doesn’t work.

Here are two methods I find useful for solving these problems, resulting in two PL/SQL functions embedding the functions used by oracle, but with a controlled exception where null values are stored and the error logged. Working on tables with several million rows I find this approach more useful than halting the process.

Safe number converter

CREATE OR REPLACE FUNCTION
safe_to_number (p varchar2)
RETURN number
IS
retval number;

BEGIN
IF p IS NULL THEN
retval := null;
ELSE
retval := to_number(p);
END IF;
return retval;

EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('Could not change value to number');
retval := null;
return retval;

END;

 

 

Safe timestamp converter

create or replace function safe_to_timestamp
( p_str in varchar2
, p_mask in varchar2)
return timestamp
is
d date;
begin
d := to_timestamp(p_str, p_mask);
return d;
exception
when others then
DBMS_OUTPUT.PUT_LINE('Could not change value to timestamp');
return null;
end;

 

I use these after the tables are imported as straight string-tables, to create a copy, but with the correct data format.

CREATE TABLE formatted_table as
SELECT
safe_to_timestamp(playdatetime,'YYYY-MM-DD HH24.MI.SS.FF') playdatetime,
safe_to_timestamp(dateval,'YYYY-MM-DD HH24.MI.SS.FF') dateval,
safe_to_number(id_table) id_table
FROM unformatted_table;