Thursday, 30 April 2020

Python & Redash data regression modeling

I was trying to do some performance data curve fitting on data related to a Java application heap size scaling against number of data objects used by application, and I thought to test the possibility of using python NumPy library to do some curve fitting using polynomial regression also thought to explore capabilities of Redash as a dash boarding tool able to pull the data from JDBC datasources and offers SQL query interface to obtain the data used for graphs from backend database.

Python proved to be quite handy and offers better regression capabilities.
Also used the Matplotlib library to do the curve plotting.
This is the sample code: https://github.com/sherif-abdelfattah/Pyplots_regression
And below is the output:


I also experimented with Postgres built-in aggregate functions that does slope and intercept calculation using least squares method, and plotted the data into Redash.
This did work too, though Redash didn't offer a decent way of doing curve fitting and I had to rely on doing the calculation work on Postgres side, mainly creating extra tables and views to hold the Linear regression information to be displayed on Redash dashboards.
On Postgres side, I created views similar to the one below:

create view reg_obj_heap as
select (SELECT regr_intercept("objcount", "max_heap") intercept FROM obj_heap) + (SELECT regr_slope("objcount", "max_heap") as slope FROM obj_heap) * "max_heap"
 as reg_obj, "max_heap" from obj_heap; 



On Redash side, I used another SQL query to build the graph:

WITH group1 AS (
  SELECT "objcount", "max_heap",
  case
 when
    "max_heap" >999 and "max_heap" <999999 then "max_heap"/1000||  ' kb'
 when
    "max_heap" >999999 and "max_heap" <999999999 then "max_heap"/1000000||  ' mb'
 when
    "max_heap" >999999999 then "max_heap"/1000000000||  ' gb'
end as "Heap"
    FROM obj_heap
),
group2 AS (
  SELECT reg_obj, "max_heap"
    FROM reg_obj_heap
)
SELECT Distinct group1.*,group2.reg_obj
  FROM group1
  JOIN group2 ON group1."max_heap" = group2."max_heap"
  order by "max_heap"; 


Complicated SQL grouping is needed to allow drawing 2 series on the Redash graph, the graph looks like this:



Python on the other hand offers good regression capabilities and still can do some decent graphics similar to what could be done in Redash, you can even export the Matplotlib graphs to html format using the library mpld3 and build your own dashboards.

The use of both Python and Redash for data analytics should help me to do effective capacity planning and performance modeling for my services in the future.


No comments:

Post a Comment