Redshift is one of the greatest Amazon AWS services which is suitable for Big data, BI and analytics purposes. Though, it is sometimes difficult to adapt yourself due to some limitations it has in comparison to even its parent Postgres. If you manage to get used to it, you will realize its benefits are incredible. In this post, I demonstrate about increasing insert in Redshift database using Java JDBC.
As it is mentioned in Redshift documentation, the best way to insert big data in chunk in redshift is using copy command in which you need to dump your data in file and upload it to a S3 bucket and then using copy command to insert to redshift. I personally tried this way and have seen the speed is super fast around 12K records inserted in couple of seconds. But, if that is not a suitable option, there is a no way except insert data through some programs. For testing sake, I chose JDBC to not bother myself with convoluted configuration with other drivers. Additionally, I would like to show how proper programming practice with using simplest tool (in this case JDBC) can increase performance significantly.
Now let’s get started.
I have list couple of programming practice to be able to insert data in Redshift, if data are in the list of objects. I have categorized the options in below sections.
Option-1 (Single insert with making connection for each insert)
Look at the following code.
In this code each insert statement, making connection, preparing statement is separated and therefore, it is totally disaster in speed. It is waste of resources and increases systems overload. In this case for each insertion, a new connection is made and the speed based on performance test is less than 2 records per second which is approximately around 100 records per minute).
Option-2 (Single insert with keeping a connection alive for all insertion)
For this approach the connection will be opened for entire process, which reduced system overload but it will not enhance the performance for Redshift insertion since it still insert a single record per transaction.
Based on performance test it inserts ~2 records per second which is equal to 120-150 records in a minute.
Option-3 (Batch insertion)
In this option, need to insert data in batch which improve performance though not super fast as copy command.
Bear in mind that in this option still insert statements are separated and we just batch them together and submit under one transaction.
For this option, I managed to increase performance to couple of hundreds to thousands per minute which is not bad. But if we have million records to insert, this approach will not be good enough either which leads us to the next option.
Option-4 (Multi rows batch insertion)
As mentioned, Amazon recommended to insert multiple rows in one transaction, but this is quite difficult to manage in JDBC and requires some manual programming to do. You can make a big SQL statement using Statement and then insert your data in a fast speed. But there is problem here and it is using statement instead of prepared statement which makes program liable to SQL injection and decreases performance slightly. This approach is suitable for only non interactive programs.
As you can see all records attached together and then inserted under one single insert statement. This is so far the best performance I managed to get by inserting around ~100K records per minute. I must say the values for record insertion and batch submission obtained after doing some performance testing and I think are near to optimal. I must admit that query in this case looks so scary and dirty.
Option-5 (Unnest insertion)
Postgres has a function called unnest which allows you to insert array in a single insert statement in a neat and clean way. Though, the biggest disadvantage of this way is that you can only insert a column of array which is the main and major problem. It is only suitable for a single entry table. The following code demonstrates the speed.
This approach has better performance than option-4 due to using SQL standard option and prepared statement in Java. The speed was more than 100K records per minute in this case.
Based on my experience and need, I recommend option-4 with consideration of data cleansing to avoid any SQL injection and vulnerabilities.
Some useful materials