A short script for testing writing many files to a folder

A short script for testing writing many files to a folder

The challenge: We want to see when the number of files in a folder decrease the performance on adding new files into the same folder. Two examples where we may need to do to this are: to get an overview of the performance of the file system node structure, or to test Windows function for 8dot3 format compatibility.

The solution: We want to create a script that writes a large amount of files to the folder in question and is logging the time taken at specific milestones. The records logged from the execution of this script can give us a time on how long it takes to write the number of files until the milestones are reached, and from this we can infer how efficient the file system is at writing files between the different milestones.

Example of output
A graph representing the number of files created over time. The X axis convey the number of seconds elapsed, and the Y axis the number of files created. How does your function look like?

The implementation: I’ve chose to set the creation of new files in a for loop which runs N times based on user input. The loop will start, open a new file with an incremental file name, write the payload to the file, and finally close the file and increment the loop counter.

Wrapped around this core functionality, we need to define into which folder the files will be created, what data is to be read and written. We need to read the defined data into a variable (we don’t want to attach too much overhead by reading the data-to-write for each iteration), create a test-folder if this is not already excising. In addition we need a function to write the timestamp, and the iteration number to a file.

To open for multiprocessor testing I’ve also add a loop for spawning new processes and passing on the data about the number of files, and to test for more scenarios e.g. renaming and deleting files, more actions have been added.

The actions, the test folder path, the input file and the number of files and processors are something which the user most likely will change frequently, so instead of keeping this hard coded in the code this is branched out to be provided by the user as command line arguments. As always when dealing with command line arguments: provide good defaults, the user is often likely not to use all the parameters editable.

From description to code this will look something like this:

import time
import os
import string
import random
from multiprocessing import Process
import multiprocessing
import optparse
import os.path

def main(files_each=100, processes=10, actions="a", log_interval=100, temp_path="temp_files", infile="infile.txt"):
  path = temp_path
  check_and_create_folder_path(path)
  for i in range(processes):
    p = Process(target=spawnTask, args=(path, files_each, actions, log_interval, infile))
    p.start()

def print_time_delta(start_time, comment, outfile=False):
  if not outfile:
    print(comment," | ",time.time() - start_time, " seconds")
  else:
    with open(outfile, 'a+') as out:
      out.write("{0} | {1} \n".format(time.time() - start_time, comment))

def spawnTask(path,files_each, actions,log_interval, infile):
  start_time = time.time()
  content = read_file_data(infile)

  print_time_delta(start_time,"creating files for process: "+str(os.getpid()))
  created_files = createfiles(files_each, content,path,start_time, log_interval)
  if(actions == 'a' or actions == 'cr'):
    print_time_delta(start_time,"renaming files for process: " +str(os.getpid()))
    renamed_files = rename_files(created_files,path,start_time, log_interval)
  if(actions == 'a'):
    print_time_delta(start_time,"deleting files for process: "+str(os.getpid()))
    delete_files(renamed_files,path,start_time, log_interval)

  print_time_delta(start_time,"operations have ended. Terminating process:"+str(os.getpid()))

def createfiles(number_of_files, content,path,start_time, log_interval):
  own_pid = str(os.getpid())
  created_files = []
  for i in range(number_of_files):
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"create","prod_log.txt")
      filename = "wordfile_test_"+"_"+own_pid+"_"+str(i)+".docx"
      created_files.append(filename)
      with open(path+"\\"+filename,"wb") as print_file:
        print_file.write(content)

  print_time_delta(start_time, str(number_of_files) +" | "+own_pid+" | "+"create","prod_log.txt")

  return created_files

def rename_files(filenames,path,start_time, log_interval):
  new_filenames = []
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"rename","prod_log.txt")
      lst =[random.choice(string.ascii_letters + string.digits) for n in range(30)]
      text = "".join(lst)
      os.rename(path+"\\"+file,path+"\\"+text+".docx")
      new_filenames.append(text+".docx")
      i += 1

  print_time_delta(start_time, str(len(new_filenames))+" | "+own_pid+" | "+"rename","prod_log.txt")

return new_filenames

def delete_files(filenames,path,start_time, log_interval):
  num_files = len(filenames)
  own_pid = str(os.getpid())
  i = 0
  for file in filenames:
    if (i % log_interval == 0):
      print_time_delta(start_time, str(i)+" | "+own_pid+" | "+"delete","prod_log.txt")
      os.remove(path+"\\"+file)
      i += 1
      print_time_delta(start_time, str(num_files)+" | "+own_pid+" | "+"delete","prod_log.txt")

def check_and_create_folder_path(path):
  if not os.path.exists(path):
    os.makedirs(path)

def read_file_data(infile):
  with open(infile,"rb") as content_file:
    content = content_file.read()
  return content

if __name__ == "__main__":
  multiprocessing.freeze_support()
  parser = optparse.OptionParser()
  parser.add_option('-f', '--files', default=100, help="The number of files each process should create. Default is 100")
  parser.add_option('-p', '--processes', default=10, help="The number of processes the program should create. Default is 10")
  parser.add_option('-a', '--action', default='a', help="The action which the program should perform. The default is a.\n Opions include a (all), c (create), cr (create and rename)")
  parser.add_option('-l', '--log_interval', default=100, help="The interval between when a process is logging files created. Default is 100")
  parser.add_option('-t', '--temp_path', default="temp_files", help="Path where the file processes will be done")
  parser.add_option('-i', '--infile', default="infile.txt", help="The file which will be used in the test")

  options, args = parser.parse_args()
  main(int(options.files), int(options.processes), options.action, int(options.log_interval), options.temp_path, options.infile)

 

 

sample_from_output

The output from running this script will be a pipe separated (‘|’) list with seconds, number of files, the process ID (since we enable the program to spawn and run similar processes simultaneously we need to have a way to identify the processes) and actions. This will look like the image below, and from this number you can create statistics on performance at different folder sizes.

The idea of performing this analysis and valuable feedback in the process came from great colleagues at Steria AS.  Any issues, problems, responsibilities etc. with the code or text are solely my own. Whatever you use this information to do, try out or anything is solely your own responsibility.

The folder image is by Erik Yeoh and is released under a Creative Commons Attribution-NonCommercial-ShareAlike License. The image can be found on Flickr.

Leave a Reply

Your email address will not be published. Required fields are marked *