Problem Statement :-
While working on one of the project, we came across the following requirement:-
Given a list of unique IDs, check whether for each ID the data is exactly the same in the database tables of two different systems. For non-matching data dump the unique ID in a text file. There is read-only access to the database tables of both the systems. The list of IDs can have millions of IDs. Oracle is the database used in both the systems.
Solution Approach :-
1) Divide the unique IDs in chunks of configurable size(default 10000) and process one chunk at a time.
2) For all the unique IDs in the current chunk, fetch data from one system and put it in a java HashMap.
3) For all the unique IDs in the current chunk, fetch data from another system and compare the fetched data with the data present in the HashMap(populated in step 2) for any given unique ID.
4) Repeat 2-3 until all chunks of unique IDs are processed.
How to fetch Data for 10K unique IDs?
The main challenge is how to fetch data for 10000 unique IDs, present in a single chunk, using a single query. We can build an SQL query containing where clause as –
where ID=1 or ID=2 or ID=3..or ID=10000th ID
OR we can build an SQL query containing where clause as –
where ID IN(1,2,3.....10000th ID)
As oracle limits the size of values in one IN block to 1000, we have to use 10 IN blocks for 10000 IDs with each block containing 1000 IDs. The where clause becomes :-
where ID IN(1,2...1000) or ID IN(1001,1002...2000) ..... or ID IN(9001,9002,...10000)
Performance Tuning :-
To gain performance out of such queries, we have to take care of the following things:-
- Create Index on the column/columns used in the WHERE clause of SELECT query. This will avoid the full table scan. Full table scan is a time-consuming operation.
- Set FETCH SIZE using setFetchSize() method on Statement or ResultSet to reduce unnecessary network round trips.- With Oracle, the default value is 10. If you want to retrieve 1000 rows it requires 100 network round trips between your application and database server to transfer all data. This will impact your application response time. The JDBC drivers are designed to fetch small number of rows from database to avoid any out of memory issues. Both Statement and ResultSet have setFetchSize() method. setFetchSize() gives a hint about the number of rows that should be fetched from the database each time new rows are needed. The goal is to reduce unnecessary network round trip.
- Use proper getXXX() methods on ResultSet – ResultSet interface provides lot of getXXX() methods to get and convert database data types to java data types.
For example, getString(String columnName) returns java String object. columnName is recommended to be a VARCHAR OR CHAR type of database. But it can also be a NUMERIC, DATE etc in which case it needs to cast it to proper java data type which is expensive operation. Now consider the expense of casting on millions of rows of data.
- Avoid Mixed-Type Expressions in where clause. If your column ID is VARCHAR2 and you want to use an index on the column ID, but the WHERE clause looks like this:-
ID IN(10010, 10011.....)
Where 10010 is of number type, Oracle translates that expression into:-
TO_NUMBER(ID) IN (10010, 10011.....)
Any expression using a column, such as a function having the column as its argument, causes the optimizer to ignore the possibility of using an index on that column, even a unique index. If the system processes even a single row having ID as a string of characters that does not translate to a number, then an error is returned.
You can avoid this problem by replacing the above expression with the explicit conversion:-
ID IN( '10010', '10011'.....)