Monthly Archives: September 2013

Fake it Till you Make it: Inserting Dummy Data for Database Query Scalability Testing

Recently I had the pleasure of performance testing a query on MySQL. The target data set had to be on the order of single-digit-millions of rows to simulate production scale data. The query had to run in a reasonable time, and the web application code had to process it without running out of memory. Using the web application itself to produce this much data would have been impossible, as a realistic scenario covered a broad time range and I wanted to test this within one day, not one year! Fortunately, inserting this scale of dummy data is not hard using straight SQL.

What the data looks like

  • let’s presume that we need 5 million records in a table
  • the primary key is a compound key defined by a DATETIME and another number
  • there are other data columns
  • data should be evenly distributed over both parts of the primary key

Things to keep in mind

We will create a stored procedure that runs in MySQL to create all this data. Before we do, there are some things to keep in mind.

  • we might want to remove foreign keys on that table (if, say, the second part of the primary key is also a foreign key to another table and we don’t have that other table also populated with corresponding data). This is only valid if you’re ONLY selecting against this table
  • we might want to include a dummy variable unique to the test data so that it can easily be removed later. If we can’t specify a value like this for any values in our records, another possibility is to create dummy data over a time range that doesn’t overlap our actual data, say, all simulated data is from the year 1981.
  • Insert a small subset of dummy data to make sure the data works with your query. There’s nothing like waiting hours for the insert to complete, only to find you accidentally swapped two columns.

Stored Procedure

Here is the stored procedure. Note that the time will be evenly distributed so that there are 10 unique records every minute for an entire year. This adds up to over 5 million records. Also note the dummy value “DELETEME” so that later we can delete this data with a DELETE FROM WHERE.

delimiter $$
CREATE PROCEDURE dummy()
BEGIN

DECLARE d INT DEFAULT 1;
DECLARE k INT DEFAULT 1;
DECLARE t INT DEFAULT 1;

-- insert 10 somethings every minute for a full year
-- should create over 5 million records, took 6 hours to run
-- the type where p is inserted can be a number or a string
WHILE (d <= 365) DO
  WHILE (k<=10) DO
    WHILE (t <= 1440) DO
      INSERT INTO `TABLE_NAME` VALUES (DATE_ADD(MAKEDATE(2013, d), INTERVAL t MINUTE), k, ‘...’, ’DELETEME');
      SET t=t+1;
    END WHILE;
    SET t=1;
    SET k=k+1;
  END WHILE;
  SET v=1;
  SET d=d+1;
END WHILE;

END;

To run the stored procedure from inside MySQL:

call dummy;

To verify your data has all been inserted, we can use the dummy identifying value:

SELECT COUNT(*) FROM TABLE_NAME WHERE COLUMN_NAME='DELETEME';

Clean up after yourself

Finally, after all the data has been inserted and the tests have been run, we can remove all this data with a simple delete. It’s a good thing we have an identifying value for the dummy data!

DELETE FROM TABLE_NAME WHERE COLUMN_NAME='DELETEME';
Advertisements

Leave a comment

Filed under Software Engineering

Containerless Applications with Spring Boot

A while back we explored containerless applications, and went so far as to build one with Spring and post it on github.

A big announcement at Spring One in September 2013 was Spring.IO, a one stop shop for exploring Spring Framework that makes it easy to get started. The projects section is focused on the individual libraries and technology components that fall under the umbrella of the Spring Framework, while the guides section has many tutorials going through the use of a particular set of technologies to solve a specific problem. Both the guides and the projects have sample projects, but the guides section is more focused on solving the needs of the user. As such it is a better place to start.

Notably, the guide projects all make use of spring-boot, which (among other things) make it ridiculously easy to start up a server because it uses an embedded server rather than requiring deployment via WAR file to an existing server. This follows the trend that I noted previously: that containerless web applications are becoming more popular in the Java world because of ease of use.

For instance: here is a ready-made containerless restful web application that works straight out of the box: https://github.com/spring-guides/gs-rest-service. You can just download, build, and run it, and hit the endpoint http://localhost:8080/greeting with your browser. To quote someone famous, “It just works!”

I used the guides as inspiration for a new containerless application, and will be recreating my old containerless application from scratch using spring-boot.

Leave a comment

Filed under Software Engineering

Editing Cells Without Unique Keys in a GWT DataGrid

Imagine that you have a GWT DataGrid to which you would like to add text cell editing capabilities. The actual code is not too hard, the GWT showcase sample code works as advertised out of the box. In this post we will look at a special situation where you might not be able to easily edit cells, and how to work around that.

In the sample, a ProvidesKey is specified to the DataGrid’s SelectionModel so that the DataGrid can tell which row is which for purposes of selecting and editing.

public static final ProvidesKey KEY_PROVIDER = new ProvidesKey() {
   @Override
   public Object getKey(ContactInfo item) {
      return item == null ? null : item.getId();
   }
};

If you don’t specify a ProvidesKey to your DataGrid, the default behavior is to use the row object itself (here, the ContactInfo for each row) as the key. In such a case, each ContactInfo would be compared for identity by using its .equals() method. If ContactInfo.equals() used the ID to determine equality, the behavior would be the same as using the ProvidesKey shown above, and it would be unnecessary to specify this key provider.

However, consider what would happen if ContactInfo didn’t have an ID, and its equals() method did not use reference equality but used data equality so that two rows with the same data would be considered the same. This could happen, for instance, if we added two empty rows and intended to edit them after both were added. For such objects, their .equals() methods would indicate that two row objects are the same when they are actually different. The result in that case is an editor rendered across multiple cells instead of a single cell. This could be resolved by modifying .equals() to use reference equality, but this is not possible if equals() is used by the backing object elsewhere for other intents, or the object is from a third party library.

To resolve this issue, we can wrap such an object with a shallow wrapper that does provide a unique key so that when a ProvidesKey calls .equals() on row objects, rows can correctly be identified as distinct.

public class HasKey {
   private static int nextKey = 0;
   private T row;
   private int key;

   public HasKey(T row) {
      row = row;
      key = nextKey;
      nextKey++;
   }

   public T getRow() {
      return row;
   }

   public int getKey() {
      return key;
   }

   @Override
   public int hashCode() {
      // trivial for brevity, should be based on .key
      return key;
   }

   @Override
   public boolean equals(Object obj) {
      // trivial for brevity, should be based on .key
      final HasKey other = (HasKey) obj;
      return this.key == other.key;
   }
}

If we use this HasKey class as the backing object instead of the original backing object of class T (and forward the appropriate rendering methods to the object it wraps) then we can differentiate backing objects with .equals() as different even when their data is the same. For example, the DataGrid would not be of type T as in DataGrid<T>, it would be of type DataGrid<HasKey<T>>.

Happy editing!

Leave a comment

Filed under Software Engineering

Beautiful Code vs Profitable Code, Part II

As engineers, we usually have a drive to make code better, to make it beautiful. While making code better is a good thing, too much of a good thing can be, well, not so good. Beautiful code makes us happy, but usually happiness is not sustainable. There is horrific code out there that is being used and making money. On the other hand ugly code is probably code that is incurring technical debt and holding the business back. Being able to see where your code is on this spectrum is an art that involves some hard questions.

Previously, we raised these questions: How do you decide where you are on this continuum of perfect code and profitable code? When should you spend more time cranking out code so you can make money now, and when should you spend more time beautifying your code so you can make money later? Or is this a false dichotomy?

To start, this is not a false dichotomy. True, you should be taking some reasonable effort to write with some level of quality while you are writing in the first place. But in the end time is a limited resource and you will always have choices about how to allocate it.

What follows are some rules of thumb based on my own experience about how to determine where you are on this continuum. Asking and answering the following questions should help you discover when to spend your time more one one side or the other.

1) Where is the software in its life?

If the software is early in its life (say in a startup, or software in its first year) then making money in the short term takes priority so that the company can survive long enough to make money in the second year and so on in the long term. By all means be responsible during development and keep an eye on the quality of your code, but development effort should be focused on bringing the product to market and competing in the here and now.

Conversely, if the software is late in its life (say, 5-10 years old or even more) then it’s likely that long-term maintenance can be given equal or higher priority over regular feature work. Customers will expect stability from a mature product, and stability comes best with code that is easy to change without introducing new bugs.

When moving between these stages, the company needs to look at the long term viability of the software. There is a danger that the engineering culture gets stuck in “startup mode” and never shifts into “sustainability mode” – potentially resulting in the extreme case of progressively worsening code. Engineering and Management must be vigilant that the shift occurs properly or the company will pay the price down the road.

2) What are the trends on the estimates?

As people make estimates on their issues, if they see that their estimates are low (meaning their issues take longer to resolve than was estimated), they will automatically compensate for the next estimate. If the next round comes and they compensate but are still low again, that means the code is getting harder to maintain over time. If you are adjusting for your low estimates but it seems like the estimates are still low sprint after sprint, it is taking longer and longer to make changes. At this point it’s time to start allocating more time towards, say, refactoring work and code cleanup. Every piece of software has its dark corners, we all know where they are!

3) What are your code quality metrics?

Static analysis of the code can give you some direction as well. Any classes with a cyclomatic complexity metric of over 10 should be given a good hard look and scheduled for refactoring, and anything over 15 should be scheduled for refactoring with a high priority over other code work. These levels have been shown to be fairly dangerous and bug-prone.

To sum it up: these are just some ideas that come to mind that I feel would help to decide whether to spend more time cleaning up vs more time writing new code. If you have ideas and want to share what works for you, feel free to share!

Leave a comment

Filed under Software Engineering

Beautiful Code vs Profitable Code

We all love writing beautiful code. Sometimes a project is like a big puzzle where all the pieces have clear edges that fit perfectly together, where you can stand back and see the big picture that the puzzle creates when fully assembled. Sometimes a project can be a thing of beauty.

But beauty doesn’t pay the bills.

Imagine, if you will, a software shop where at every step of the way they dotted every I and crossed every T. This is a place where TDD and refactoring code into perfection takes priority over adding features or fixing bugs. Every time a new feature is discussed, developer response is along the lines of “If I can spend another week refactoring Module X, then implementing this feature would be much easier.” Taken to an extreme (and I have seen this happen) the system becomes more maintainable and beautiful but is not making any money. This situation leads to a rapid loss in feature parity with competitors, which in turn leads to the collapse of the business because it is literally impossible to add features or fix bugs as quickly as competitors.

Imagine another extreme. Imagine a software shop where things like modular design, software best practices, testing, etc, are not in management’s vocabulary. There is no developer bandwidth allocated to write tests, to refactor, to remove duplicate code. Every minute of developer time is pushed towards nothing but writing code and shipping it as quickly as possible. Taken to an extreme (and I have seen this happen) the system over time becomes less and less unmaintainable. This situation leads to a rapid rise in cost to make any changes at all, which in turn leads to the collapse of the business because it is literally impossible to add features or fix bugs at the same cost as competitors.

Somewhere between these two worlds there is a balance between perfection and profitability. How do you decide where you are on this continuum? When should you spend more time cranking out code so you can make money now, and when should you spend more time beautifying your code so you can make money later? Or is this a false dichotomy? We’ll delve into the answer in the next post…

Leave a comment

Filed under Software Engineering