Improve data processing performance on AWS Redshift by 200%

So what are the challenges when your business has grown to such a level that your database infrastructure is no longer fit for purpose? We undertook a monster AWS Redshift database migration to help resolve such as issue. What do I mean by monster? Well, the high-level numbers are 450 Billion record and 20 TB of data!

Our client, an acclaimed consumer electronics manufacturer had an infrastructure of sub-optimal database design. This was significantly impacting the performance of data loading and reporting. A crucial feature for any data-driven business. They were also experiencing a rapid growth in the number of reporting users, reporting workload and their application data. 

One of the crucial tasks for us in this project was elevating the performance of the database and queries.

Improving database performance is an exciting task, but it can be pretty daunting if the right things are not looked into. In this particular case, we needed to optimised the queries and add nodes to the current cluster. For this we opted to use AWS DC cluster.

When we tried to apply various changes in the database configurations and queries, although we were getting improved performance, we were not able to achieve the required results. We needed reduce the time taken to load and process data from SQS to database in less time from 24 to around 8 hours.

We tried many approaches like changing AWS Redshift cluster type, changes in node types, increased number of nodes, but were still not able to achieve the required performance. The best we could achieve via this route resulted in an improved performance of 25% of our targeted performance.

We checked performance improvement with AWS dense storage and AWS dense compute but that did not make much difference. We made these checks with log file generated from python script for copy and process data time for each table. 

We had also checked for data file process in python on EC2 system. In this process we loaded data file in memory using python scripts but this process needs reference data from the Redshift database.  Loading data from Redshift and mapping this data with file data loaded in memory takes more time for processing data.

Finally we decided to go for python threading. We split our processes with major tables and ran them in parallel using python threading and what we got was a 200% improved performance.

We were able to load and process 1 days’ worth of data in less than 8 hours’ time.

For splitting our process with major tables, we have consider dependency and considered any lock issue with parallel process. We had separated copy, insert, backup and delete script for all tables so that we could run them in parallel and achieved the best performance.  

Old slow process without thread:

 


for key in keys:

if(tablename in key.name and filetype in key.name):

keyname = key.name

filename = keyname[keyname.index('/')+1:]

logger.info(('Start Copying data into table {0} from s3').format(tablename))

# Copy query statement

sql = ("COPY {0} \nFROM '{1}{2}' \nCREDENTIALS \n'aws_access_key_id={3};aws_secret_access_key={4}' \nmanifest \nDELIMITER '|' \nGZIP \nESCAPE \nNULL AS                               '\\N';").format(tablename,s3bucketURLUnload,filename,connkey['access_key'],connkey['secret_key'])

sqlcount = ('\nselect PG_LAST_COPY_COUNT();')

sql=sql + sqlcount

print(sql)

 

# Call ExecuteQuery method to run query

copycount=common_function.execute_query(sql, client_redshift, tablename, 'copy')

logger.info(('copycount: {0} for key:{1}').format(copycount,key.name))

if(copycount > 0):

insert_data(tablename, client_redshift, scriptfile, sqs_insert)

else:

logger.info(('copycount: 0 for key:{0}').format(key.name))

 

key.delete()

 


New optimize Process with threading:


for tablename in tablelist:

 

                                                # test = (bucketname+"/"+tablename+"/")

                                                test = (tablename+"/")

                                                # Preparing bucket list for single table

                                                keys = list(bucket.list(prefix=test))

                                                # Sorting bucket list based on last modified date

                                                keys.sort(key=lambda k: k.last_modified)

 

                                                if len(keys) == 0:

                                                                pass

                                                else:

                                                                if(tablename=='xxx'):

                                                                                keys = keys[0:60]

                                                                elif(tablename=='yyy'):

                                                                                keys = keys[0:20]

                                                                else:

                                                                                keys = keys[0:30]

 

                                                                # Starting thread execution

                                                                t = threading.Thread(target=comman_func_copy, args = (client_redshift, tablename, keys))

                                                                t.daemon = True

                                                                threads.append(t)

 

                                # start all threads

                                for x in threads:

                                                x.start()

 

                                # # Wait for all of them to complete

                                for x in threads:

 

                                                x.join()


By: Laxman Amrale

About Ardent

Established in 2008 and boasting a software and data engineering team over 120+ professionals, Ardent is one of the UK’s first established Amazon AWS partners and have considerable expertise in delivering complex data migration, ETL and AWS management and monitoring services. 

We work very closely with the AWS technical team to ensure we develop the best solution for your business needs.

 

For more information on our big data or cloud management services, or indeed to if you are facing similar database performance issues and would like further information from Laxman please call us on +44 (0) 1582 870 180 or contact us by email at This email address is being protected from spambots. You need JavaScript enabled to view it.