I’ve spent some time over the last couple of weeks working on a Grails service that allows us to import new records into our MySQL database.
A number of interesting techniques have popped out of this that I think would be useful for others doing similar types of importing using GORM/Hibernate/MySQL.
Background
For background, the imports in our app happen a few times a week, and range in size from less than a hundred up to a few hundred thousand new records.
It is also possible for us to get updates on previous data that we’ve imported, if a record in the import matches a record in our database, we want to update the existing record rather than inserting a duplicate.
We had a desire to use a Grails service to do the importing as we could leverage our existing data model, plus, there are some special cases where the import logic can get a bit hairy and having access to all of our domain methods is nice.
If you’re doing much more frequent imports, with heavy performance requirements using millions to hundreds of millions of rows, you’ll likely want to look at another tool more specifically targeted for batch/ETL processes or writing something directly in Java. For smaller jobs, you might be surprised how far you can get with groovy and grails.
Example Naive Import Application
To demonstrate the techniques, I’ve created a small test application. This simple app only has Books. It imports 100,000 randomly generated books in a single job (importing a Library of Books). If our database already has a book with that ISBN and edition number, we want to update the title of previous book rather than creating a new book.
Here are the setup files that won’t change as we tweak our application.
The Domain class:
// domain/Book.groovy
class Book {
String title
String isbn
Integer edition
}
The BookController that calls the service:
// controller/BookController.groovy
class BookController {
def bookService
def index = {
def library = new Library()
def startTime = System.currentTimeMillis()
bookService.importBooksInLibrary(library)
render "time: ${(startTime - System.currentTimeMillis())/1000} seconds"
}
}
And a Library class that’s the source of our books (don’t worry too much about the details of the Library class, just know that it’s the source of the fields that we’ll be importing. For each book, it simply returns a map of the field values that should be saved or updated for the Book):
class Library implements Iterator {
def startTime = System.currentTimeMillis()
def lastBatchStarted = startTime
def BOOKS_IN_LIBRARY = 100000
def currentBook = 1
def random = new Random(12345) // random generator with defined seed
Iterator iterator() { return this as Iterator }
boolean hasNext() { currentBook <= BOOKS_IN_LIBRARY }
void remove() { }
def next() {
if (! (currentBook % 100) ) printStatus()
return [
title: "Book ${currentBook++}",
isbn: randomIsbn(),
edition: randomEdition()
]
}
def randomIsbn() {
// one of 50,000 random isbn numbers
return "isbn-${random.nextInt(50000)}"
}
def randomEdition() {
// first through tenth editions
return random.nextInt(9) + 1
}
def printStatus() {
def batchEnded = System.currentTimeMillis()
def seconds = (batchEnded-lastBatchStarted)/1000
def total = (batchEnded-startTime)/1000
println "Library Book $currentBook, last batch: ${seconds}s, total: ${total}s"
lastBatchStarted = batchEnded
}
}
The Library class is a proxy for whatever the source of your import info is (a .csv/excel file that you're parsing through, a webservice, something posted to your server, etc.).
Also, to help see how we affect timing, every 100 books the Library prints out how long it took to process the last 100 books.
Here is the naive BookService that iterates through the Library that the BookController gives it. For each book it receives a map of the books values. It does a findByIsbnAndEdition() to see if the book already exists. If it does, it updates it, otherwise it will create a new book:
class BookService {
def importBooksInLibrary(library) {
library.each { Map bookValueMap ->
updateOrInsertBook(bookValueMap)
}
}
def updateOrInsertBook(Map bookValueMap) {
def title = bookValueMap.title
def isbn = bookValueMap.isbn
def edition = bookValueMap.edition
def existingBook = Book.findByIsbnAndEdition(isbn, edition)
if (existingBook) { // just update title
existingBook.title = title
existingBook.save()
} else {
new Book(bookValueMap).save()
}
}
}
When you run this code, the first thing you notice is how slow it is. The first 1000 records aren't that bad. On my old laptop they take about 7 seconds to insert. After that things get slower and slower, books 10,000 to 11,000 take 20 seconds to insert. Importing gets progressively worse as time goes on.

Yikes! It takes 2 hours and 36 minutes to import 100k records? Over 3 minutes for the last thousand books to get imported? That can't be right.
As it runs, if you open up your activity monitor/task manager/top application, you'll also see that Grails is running very hot but also that MySQL's CPU usage is slowly going up over time.
MySQL Database Performance Tweaks
Start with the obvious: Database Indexes
If you've done much database work, you know how much indexes can improve the speed of a query. But how do you know what queries need indexes?
In some situations, like our example above, it can be fairly obvious. We're calling Book.findByIsbnAndEdition(). Chances are we need an index on the "book" table that includes the "isbn" and "edition" fields:
CREATE INDEX `idx_book_isbn_edition` ON `book`(`isbn`, `edition`);
What if it isn't so obvious which queries need an index?
There are a number of techniques available to us through grails to see what kind of SQL is getting executed.
Turn on grails SQL logging
Within grails, you can turn on SQL logging by putting this in your Config.groovy's log4j closure:
trace 'org.hibernate.SQL'
Install the p6spy plugin and JDBC SQL profiler
The p6spy grails plugin inserts a proxy between your database driver classes and Grails. Because of this, it knows the full query, including the filled in query parameters and some timing information.
This is useful by itself, but you can have the p6spy plugin log it's messages to a log4j socket appender that talks to the JDBC SQL Profiler which will aggregate all of the information and help you decide where you've got slow queries and need indexes.
Mike Hugo has a great post on using the p6spy plugin with the JDBC SQL profiler.
Turn on the MySQL Slow Query Log
If you're using a recent version of MySQL, there's an easy, built-in way to see which queries are slow and which queries aren't using indexes: the slow query log.
To turn it on, you simply need to add a few lines to your mysql cnf file (by default on OSX just create a file at /etc/my.cnf and it will get loaded):
[mysqld]
log-slow-queries
#Log all queries > long_query_time seconds, default 10 seconds
long_query_time=1
This will automatically create a log file in your mysql "data" directory called host_name-slow.log that contains all queries taking longer than 1 second.
You can optionally add
log-queries-not-using-indexes
And this will also log those queries that don't have an index (but don't take longer than 1 second to execute).
MySQL also comes with a perl script to parse through the slow log file and show you the number of times each query is slow with some additional statistics. To run it, you need to add basedir to your my.cnf file. The basedir value just holds the location of mysql's data directory.
basedir=/usr/local/mysql
Then run the script:
mysqldumpslow
These steps should get you a long way towards figuring out which tables need indexes.
Turn on the MySQL General Query log
MySQL has an additional query log called the general query log. It logs everything that comes into MySQL and is extremely noisy. One benefit of it though is that it gives you the full query with all parameters, so it's more useful than turning on grails SQL trace log and potentially a little quicker to set up than the p6spy plugin.
You can enable it by putting this in your my.cnf:
general_log=1
Suggested Reading for MySQL indexes
If you haven't created many indexes before, I highly suggest reading this post on indexes on the mysql high performance blog. The executive summary is that multi-key indexes tailored to your query have a performance edge over individual key indexes on each of the fields. If you're executing a query enough times, it might be worth creating a multi-key index like we have above.
MySQL Database Engines
If MySQL is still your performance bottleneck, or if you're finding that everything else comes to a stop when you're running your batch import, check which storage engine you're using.
For grails applications, I'd prefer InnoDB (or Falcon) to the default MyISAM. MyISAM is good for read-only situations where you don't care about enforced foreign key releationships. Switching to InnoDB has significantly increased the performance of my app in a number of situations. InnoDB also supports row level locking, whereas MyISAM will lock the entire table when doing an update or an insert.
If you haven't explicitly done anything, chances are that all of your grails-generated tables are MyISAM. You can change the default storage engine for mysql by specifying a different engine in your my.cnf:
default-storage-engine=innodb
To see the default is for your database execute:
show engines;
To see the engine for a particular table, use "show create table":
mysql> show create table book\G
*************************** 1. row ***************************
Table: book
Create Table: CREATE TABLE `book` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`version` bigint(20) NOT NULL,
`edition` int(11) NOT NULL,
`isbn` varchar(255) NOT NULL,
`title` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=84192 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
If you need to alter the storage engine for an existing table, you can execute the alter table SQL statement:
ALTER TABLE book ENGINE = INNODB;
Grails Performance Tweaks
Adding an index and modifying the storage engine of our example book table to InnoDB make our sample app 25-30% faster. A nice boost from where we were originally, but things are still way too slow:

Adding indexes to the database drops the database CPU utilization to almost nothing, but grails is still pegged and it still gets much slower over time.
Something is amiss.
At this point, some people might be inclined to believe the myth that "grails is slow" and go through a painful rewrite in another "faster" language/technology.
Don't go there yet! It's only slow if you don't understand what's actually going on under the covers.
After reading through a couple of very helpful posts by Burt Beckwith, and reading through the Hibernate chapter on batch processing, we have all of the information we need to speed things up significantly.
Use a Monitoring Tool to Watch Memory/CPU usage Over Time
Because the import gets significantly slower over time, even with the index applied, we know there is some sort of leak.
If you don't have your code instrumented with timing logic already in it, it's easy to pop open jconsole or jvisualvm (both already installed and likely in your classpath if you have Java 1.6), connect to your grails instance and watch performance/cpu utilization over time.
The Culprits: Hibernate First-Level Cache and Grails Validation Cache
I found on Burt Beckwith's blog, that there are 2 separate leaks, one of them is in the hibernate first-level cache, the other is a map that Grails uses for domain object validation errors.
Normally, a grails hibernate session executes something quickly and returns. During importing, we do a ton of processing, all with the same hibernate session. All of these objects that would normally be garbage collected when the session closed are piling up.
The easiest way to deal with this is to create a simple method to clear out these collections periodically.
We can modify our BookService to clean up GORM after every 100 books we insert:
class BookService {
def sessionFactory
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
def importBooksInLibrary(library) {
library.eachWithIndex { Map bookValueMap, index ->
updateOrInsertBook(bookValueMap)
if (index % 100 == 0) cleanUpGorm()
}
}
def cleanUpGorm() {
def session = sessionFactory.currentSession
session.flush()
session.clear()
propertyInstanceMap.get().clear()
}
def updateOrInsertBook(Map bookValueMap) {
// ... same as above
}
}
This change makes everything much better. All 100,000 books get inserted in 3 minutes, the same time that it took the naive example to insert the last 1,000 books.

Also, because MySQL is now properly using indexes, the load on it is very light. If we're looking for even more speed improvement, this would allow us to split our work up into batches and to make multiple BookService calls (potentially on different web servers). Because we've switched to using InnoDB, we're now working with row-level locking, which isn't hostile to this approach unlike MyISAM and it's full table locking on Insert/Update.
Conclusion
This post is just some of what can be done to improve batch performance with Grails and MySQL. Using these tips, I was able to speed up my own imports to a level that met the speed requirments for our problem space, but kept our code much more maintainable. Our domain object remain the canonical code representation of our database model.
There are a number of other things that could be looked at to further speed up batch performance and that could be important to your particular application/infrastructure.
- connectivity between app server and DB
- giving enough memory to your app and DB servers
- other load into database and app servers
- anemic hardware
- poor database design
Be sure to weigh the cost of making changes in each of these areas, and run benchmark tests before committing to them. You'll often find that what you think is the bottleneck, isn't. Speed is a feature, but you need to balance it with the costs to achieve it.
Fantastic blog entry. Not needed right now, but I’m sure, that some day it will be handy!
Keep it up!
Thanks! One bad thing about grails is you kind of forget to this stuff! but hey, everyone has their flaws! :)
[...] This post was mentioned on Twitter by groovyblogs.org and Java and Groovy Geek. Java and Groovy Geek said: Shared Article : Batch Import Performance with Grails and MySQL [Ted Naleid] http://bit.ly/IH8Lj Worth a look #grails #mysql [...]
Awesome post Ted! I could have used this a few weeks ago when I converted a batch import process from using GORM to a Spring based SimpleJdbcInsert for performance reasons.
Do you think using multiple threads might also improve performance of this type of job (provided your data is thread safe and your server has multiple cores)? I’ve done that with simple Java processing before and it helps – but am curious what it would be like with GORM.
Best post on JEE/MySQL performance and bottleneck detection I’ve ever seen.
Thank you for sharing with us.
Very nice article!
This is by far one of the best I’ve read in quite some time! Keep it coming!
Great post, Ted — thanks for sharing your pain and hard work to resolve it.
Good one, Ted! Thanks for sharing that.
Nice writeup, very useful.
Based what you wrote I’ve created the below closue, also below is a updated importBooksLibary. Really usefull as you can pass a list / iterator into the closure, i’m using it with xml slurper which provides a iterator instance to save domain objects.
def withSessionCleaner(def items, Closure c){ items.eachWithIndex { obj, index -> c(obj) if (index % 100 == 0) { def session = sessionFactory.currentSession session.flush() session.clear() propertyInstanceMap.get().clear() } } }def importBooksInLibrary(library) { withSessionCleaner(libary){ updateOrInsertBook(it) } }Thanks for the useful information.
However, this seems like a bit of a hack. Doesn’t hibernate have a correct way of performing batch operations? I can’t believe that Hibernate, being as mature of a library that it is, would still require these kinds of hacks to work correctly for common operations like this.
It’s always this kind of stuff that makes me skeptical of ORM.
Nice article. We recently went thru a performance enhancement fixes for MySQL. We are using InnoDB and ultimately got the biggest bang by doing two things:
1) increased both the innodb_buffer_pool_size and innodb_log_file_size
2) doing batched inserts/updates. Ours is a Java app and not a Grails app but we do use Hibernate. I would expect that there should be some way of doing batched statements in Grails but I don’t remember seeing anything on it yet – but I haven’t really looked either. :-)
Thanks everyone, I’m glad that many of you found the post useful.
@Mike (Hugo): I did actually have a partially working multi-threaded prototype using gpars (nee GParallelizer). I did this before I figured out the stuff with the hibernate/grails caches so it helped, but still got really slow.
The biggest hurdle with doing multiple threads is that the way that grails works is that it automatically ties a hibernate session to the current thread. If you spin off a bunch of extra threads, you need to manually create your own hibernate connection for each of them. It’s doable, but I didn’t actually need that level of performance once I had the cache breakthrough. I also had created a method where I could split my import file up and call the service N times and it would automatically parallelize that way without needing to worry about doing my own thread handling. For some applications that could be a good way to go.
@Adam Evans: Nice! Having a closure like that automatically do the work is a good enhancement that helps with maintenance/readability.
@Mike2 I guess it depends on what you mean by “hack” :). There are actually further methods beyond what I discuss in this article that hibernate provides to you if you don’t want to brute force clear/flush your entire cache. You can call “evict” on the hibernate session to remove specific instances. That gives you fine-grained control over what’s in the cache. Additionally hibernate/grails provide an executeUpdate command that uses HQL. You can also get access to a raw JDBC connection if you want.
My goal was to keep things in the grails world so that we can leverage our domain classes/logic with acceptable performance. Having a single method that clears and flushes the caches feels easy to understand and maintain to me without having to do any mental gymnastics or reimplementing of code. If that’s a hack, I’m OK with that :).
@Mike Miller (Mike3? :) Thanks! Nice tips that I wasn’t aware of on the innodb settings. As I mentioned just above to Mike2, grails gives you access to the hibernate session, and also provides an executeUpdate statement, so you can go to that level if your performance requirements dictate it.
I think that all of these comments point out that you want to find the right balance between “fast enough” and “easy enough to write and maintain”. There will definitely be some applications/problem spaces where you should not be doing your batch importing with grails. But, if you already have a grails application/domain model, you might be able to get satisfactory performance with very little work.
Tietokantakyselyjen optimoinnista…
Sain viime viikolla tehtäväkseni parantaa IC Insightille…
Ted,
Great post! I’ve been struggling with getting the p6spy logging working using 1.2M2.. I have left a comment in Mike post on that topic, but was itnerested in how you personally setup your logging setup since I suspect you are using the latest Grails.
What’s different about how you setup than what Mike has documented? Have no good leads to help other than to ask.
Cheers
J
Doesn’t hibernate have a correct way of performing batch operations? I can’t believe that Hibernate, being as mature of a library that it is, would still require these kinds of hacks to work correctly for common operations like this.
Hibernate does provide a way to avoid the manual clearing of sessions, with the “Stateless Session”:
http://docs.jboss.org/hibernate/stable/core/reference/en/html/batch.html
In grails you would need to create that session separately from the one Grails provides (see above link for an example).
If you’re just inserting, the hack would be to create an insert statement manually and ask Hibernate to execute it, bypassing all of Grails’ validation. I’ve found that improves the performance by about another 75%.
@Dave what are you considering a hack? The stuff in the post above is the same solution recommended in the hibernate docs that you linked to (above the stateless session info), clearing and flushing the hibernate cache on long running threads (plus the addition of clearing an additional grails cache that is a bit of a hack :).
You could create insert statements manually if you need that additional 75% performance boost, but I’d rather lose that performance and have all of the domain methods that grails gives me to manipulate my objects and the validation they provide. That gives me performance that meets the requirements for what I need, without creating my own insert statements.
If you’re doing that, you might as well just go to straight JDBC which would be even faster still.
The point isn’t the fastest solution possible, it’s acceptable performance without giving up all the extra benefits of the grails domain objects. If you need the fastest performance possible, using grails to do batch inserts is the wrong tool for the job.
Ted,
I couldn’t believe when I stumbled upon your post, as I have been struggling with this same problem for a couple days. I tried all the same initial steps you did and improved my load of ~60K domain objects from 20 hours to just under 12. Your suggestion, however, was the key. Now batch loading in about 5-6 minutes and all seems right with the grails world again.
Thanks!
brian
sorry for the multiple posts but some of the text does have gone missing. feel free to edit or delete the first one.
we find that although this solves speed issues it causes other problems ‘outside the box’
one problem is that after cleaning up the session every 100 inserts you have to do the clean up again. this is because the number of inserts is not exactly a multiple of 100
def xmldata = new XmlParser().parse(file) xmldata.bookList.book.eachWithIndex{ it, index-> Book book = new Book() book.save() if (index % 100 == 0) cleanUpGorm() } cleanUpGorm()the extra cleanup is needed so everything is persisted
another problem is that since the session is cleared entities outside are affected
Author author = Author.get(23); def xmldata = new XmlParser().parse(file) xmldata.bookList.book.eachWithIndex{ it, index-> Book book = new Book() book.author = author author.addToBooks(author) book.save() if (index % 100 == 0) cleanUpGorm() }You would get an org.hibernate.LazyInitializationException
The reason: failed to lazily initialize a collection of role: Author.books, no session or session was closed.
@Christopher
Thanks for the response (I deleted your first response and tweaked your post to have the code in pre tags). I haven’t seen your first problem, but I’m wondering if you’re running your code in a service where you’ve marked transactional = false. That might explain the behavior that you’re seeing where the last stuff isn’t committed.
If so, the extra cleanup would be necessary, or you’d want to handle your transactions slightly differently and pass off 100 (or whatever your batch size is) items off to another method that was transactional and then clean up after calling that method.
You’re second point about potentially causing LIEs is absolutely true. If you’re doing other things outside of importing with the current thread, you’ll want to make sure to reattach any objects to the session after you’re doing your clearing.
The process that we use to do all this doesn’t do anything else but import a bunch of stuff, so this isn’t an issue for us but is definitely something to keep in mind.
Seems in DomainClassGrailsPlugin the domain objects are getting set not only on the propertyInstanceMap but also on the RequestAttributes so those have to be cleared as well.
def attributes = rch.getRequestAttributes()
if(attributes) {
attributes.request.setAttribute(key,val)
}
else {
PROPERTY_INSTANCE_MAP.get().put(key,val)
}
Hi Ted,
And thank you for this great post ! I will definitely try all of your tips for my batch imports that were very slow and very memory-consuming !
Again than you for sharing this
BTW, I didn’t manage to have p6spy and sqlprofiler working together (no queries are shown in the SqlProfiler console) with Grails 1.2 under Windows Vista). Like j pimmel, I was curious to know if you did manage to have it work and if you change something from what is explained in Mike Hugo ‘s article?
Regards,
Fabien
@fabien7474 I haven’t used p6spy with grails 1.2 yet, that was on a previous grails 1.1.1 application that I’m no longer developing so I’m not much help right now about that. I’ll be sure to post if I get a chance to get p6spy going in 1.2 or greater.
create post and really helped me and my problem to import millions of records into a postgres database.
Only thing, it’s still gets too slow after a couple of 100k inserts, since the performance gets gradually worse over time. It starts at 50ms and is at 200ms after 200k inserts (nothing compared to before flushing the session, but still)
So there is still something amiss with grails + hibernate + DB
any more ideas?
@berlinguyinca who’s load is increasing after 100k inserts? Is the database load increasing or java/grails?
If it’s the database, it could be that you’re missing an index (or need to tune an index that you already have).
I’m less familiar with current postgres installs, but I also know that in some batch programs I created against postgres a few years ago that batching thousands of records together in a transaction was a pretty big help.
If you’re doing any deletion/modification during your import, it could also be that you need to vacuum your database (something that isn’t necessary in mysql).
hi,
great from you to respond. Well yes we do update/delete/select/insert (it’s a merging tool for several databases)
So all 10k inserts I tend to rebuild the indexes and the auto vacuum daemon is running and taking care of these issues (did not really tune the auto vacuum)
I do think it’s grails and not the db and yes I also use batch statements.
I checked for indexes and monitored the executed queries (p6sql spy + jdbc profiler, which I slightly rewrote to make it work with current version of log4j, I will upload a fork in a week or two) And it looks that I have all the required indexes.
thanks again,
any other ideas? So far there were good, but already taken care off. I really think it’s more of a grails issue. I ordered some SSD’s for the server to speed up index access, but they won’t arrive before mid/end next week. IO Doesn’t not seem to be the bottleneck anyway.
@berlinguyinca If you haven’t done it already, I’d fire up jconsole (or jvisualvm, either should be in your path already if you’re using java 1.6) and connect to the grails process. Let it run for quite a while and see if it looks like the memory is continuing to go up and up, or if it sawtooths up and then comes back down to the same level again.
I’d also suggest looking at using a profiler to see where your code is spending most of it’s time. Yourkit is good, but not free. I haven’t had a lot of luck with the free ones out there, but I think yourkit has a trial period.
If that doesn’t get it done, unfortunately, it’s possible that your particular use case is complicated and large enough that grails/GORM isn’t the right tool for the job.
Excellent post. I can’t believe you had the patience to gather so much data!
Cheers,
dan
[...] stumbled upon this post http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql/. It’s an excellent read and gives some insight into some of the inner workings of both [...]
[...] Batch Import Performance with Grails and MySQL – Ted Naleid [...]
[...] información sobre Batch Import Performance with Grails and MySQL (traducido al [...]
it’s me again,
still working on the problem and managed to drop the curve a bit with the use of a couple of SSD’s in our db server, but still not enough to make it feasible.
basically grails seems to store some more data somewhere, i guess. Reason that I’m thinking this is that I can import 100k objects with there relations in a fairly quick time of 100-200ms an insert, till I hit a certain point and than the curve jumps up again.
this is my curve for about 1.2 Mio inserts.
http://dl.dropbox.com/u/5396494/curve.png
The problem is defiantly on the grails side or my rather simple algorithm, since there is now load on the DB during this progress or queries.
@berlinguyinca I could be wrong, but I think you need to take a look at the queries that are getting generated by your code. See how many times for each insert each query is getting called. There might be situations where you could use an executeQuery rather than pulling back and hydrating entire database objects.
You’ll want to beware of the n + 1 problem too, and might be better off doing eager loads of objects rather than lazy loads. It’s hard to get into much more detail without actually seeing code.
If you’re doing hundreds of queries for every insert (even if they’re all really quick to start), that’s going to bog down over time. Looking at (and minimizing) the number of queries might be your best bet.
I’d also look at connecting jconsole to see if memory is going up and then not being released too. That could point to a memory leak on the grails side.
Good luck.
Loading mass data in the database using Hibernate is something like transporting people on the highway using motorcycles (one record a time). Of course it is possible to scale between a low class bike and a Fat Boy, but this doesn’t mean this is the right way. The argument about validation of te data with the domain class is valid, but this is neither argument for Grails nor for Hibernate. It should be possible to reuse the validation closure from an external loading process as well.
But anyway this is a great tuning result that many will find useful.
thanks,
still debugging it. It’s getting better, but not perfect yet. One days I will have it perfect.
I wonder how GC gets impacted by massive creating of objects in Eden segment. Has anyone run into long GC process that blocks application for some seconds?
If so, can you please share GC configuration? Parallel? Old parallel? New concurrent mark sweep? Size for Eden and Mature segments?
Thanks.
Thanks! Very comprehensive …
Small correction: if ((index+1) % 100 == 0) cleanUpGorm()….to avoid unnecessary commit at 1st record
Thanks for this article. It really helped!
Thanks for this post! I ran across it quite a while ago, and filed it away for future use. I just remembered it and used it to reduce a batch job I occasionally have to run from 2+hours to < 4 minutes. This makes life easier.
[...] information on improving batch import performance, read Ted Naleid’s post on that topic. September 23rd, 2011 | | 2 views | Category: Groovy | Leave a [...]
[...] A couple of years ago, I put up a well-received blog post on tuning Batch Import Performance with Grails an MySQL. [...]
wow man………………… The most difficult thing explained in most simple way
[...] Ted Naleid [...]
This post is amazing. I love how you used a systematic approach that involved several incremental tweaks until the final result was achieved. I’ll definitely be spending some time looking over your other posts. Thanks!
Absolutely brilliant!
Just reduced something that was going into the ‘days’ to 20 minutes.
Cheers!
Most database systems have a dedicated batch insert mechanism, some of them can be accessed over JDBC. In postgresql you have COPY, which will happily load a million records in around a minute on my laptop, but don’t allow you do to do a record by record pre-exist check.
Ultimately a row by row existence check is at it’s core, in conflict with the relational model. You’re almost always far far better off inserting the records into a holding table, then using a sql statement to update the records in the main table based on inequality with the holding table, where the field used for inequality check is indexed. This will allow the database to use the indexes to optimize the write that will happen into a single IO operation that has no network overhead. This dramatically improve performance, and if you can use the bulk import commands, you can get more than two orders magnitude.
Every time you issue a select, that query has to be created by GORM, formed into a string, sent through the JDBC layer, down through the network stack, to TCP, over ethernet, across at least one switch, through to the database, which has to parse it, generate a query plan, populate the values for the select and execute the select. Now having results, it forms them back into a data stream, shunts them back over ethernet which GORM has to deserialize into Java values and return to you. Every single select you issue will result in a full round-trip to the database, application, stack, and network. Not only that, but it’s a fully synchronous round trip. You’d better hope that the data is cached on the other end, and your database server is very proximate on the network if you even want a semblance of performance.
When you issue a large batch insert without selects, that select round-trip time is completely eliminated. Not only that, but your driver will generate a prepared statement for the inserts and execute against it over and over and over. Once you put a select in-between each of those – you may lose your prepared statement performance benefits (depending on your database vendor and driver).
If you use the bulk load, you will negate the need for generating a query plan at all, the database doesn’t have to issue any acknowledgement until the entire stream is done, and it can often skip writing to the transaction logs, as a batch load will either fail in entirety or succeed in entirety. By using holding tables that don’t have to have keys, constraints or indexes, you can also significantly improve insert time. When you merge your holding table into the main table, the database will be smart about efficiency because it has all the pieces in one place. I might do some benchmarks here to validate what I’m talking about :).
If you are using MySQL there is a bulk loader you can use via JDBC. You essentially write your data to an in-memory tab-separated buffer and pipe that to a setLocalInfileInputStream on the statement. I got a 500x performance speedup by using this route for 20K rows- even after I had done Beckwith’s performance tricks on the grails version. The grails version is also nastily non-linear, every extra 1000 rows takes roughly 10% more than the last.
However, I couldn’t get hold of the JDBC Bulk loader from hibernate. I think you need to open a separate jdbc connection – you can find credentials in the datasource config. Keep an eye on your transactions – the controller is in a different transactional context from the bulk uploader
Also there is a nasty problem around dependencies between linked objects. Basically, the standard key that grails uses to link the belongsTo/hasMany doesn’t exist until you do the insert. The best way seems to be to do a bulk load into a denormalized table, and then create the entities you need by joining that table with entity tables. Typically you need to do an outer join with a group by and then test for null entries in the existing table’s fields. That allows you to insert only those entities that aren’t duplicates. Once you’ve done that for the branches of the dependency tree, you can move towards the root of the treee by using a query to re-find the entities you have already inserted, and insert their ids in the entities you are creating.
The whole thing is a bit brittle – if you change your domain classes it will break. Also
One final gotcha, if you are developing on Windows, table names are case insensitive. If you are deploying on Linux they are in lowercase.
Thanks for the post Ted. I was facing exactly same issues with Grails 1.x which were resolved using the solution provided by you. Around a week back be upgraded the System to Grails 2.1.0 and the same performance issue again cropped up. I read in other posts that this error has been fixed in Grails 2.x but I am facing still performance issues. Any help ?
@Lav – I’ve used the same techniques on 2.0.X, but haven’t tried them on 2.1.X yet. Some of the stuff above should be generic enough that it should help you identify where things are going on in any version of grails (mysql logging, looking at jconsole for memory consumption continuing to go up, etc), good luck!
Wooow! Amazing post.
With this solution I’ve been able to make a migration inside the db-migration plugin very-very fast.
Thank you for sharing this.
Regards, Iván.