Summarized using AI

Optimising your database for analytics

Karen Jex • September 11, 2024 • Sarajevo, Bosnia and Herzegovina • Talk

In her talk at EuRuKo 2024, Karen Jex, a Senior Solutions Architect at Crunchy Data, discusses strategies for optimizing existing databases for analytics workloads without sacrificing the performance of day-to-day application activities. The presentation addresses the challenges posed by running analytics queries on operational databases, primarily OLTP (Online Transaction Processing) systems which are optimized for fast, transactional queries. Jex highlights the following key points:

  • Understanding Database Workloads: The talk differentiates between OLTP and OLAP (Online Analytical Processing), emphasizing that OLTP databases handle short transactions while OLAP queries are typically long-running and resource-intensive, requiring analysis of large datasets.
  • Challenges of Mixed Workloads: Integrating analytics into an OLTP database can degrade performance, leading to slow queries and impacting operational efficiency. Jex refers to research showing mixed workloads can cause significant performance challenges.
  • Benchmarking Analytics Queries: Using PostgreSQL and the benchmarking tool pgbench, Jex illustrates how to create a test environment to analyze the performance of queries. A sample analytics query run in the demo highlights the inefficiencies that arise when performing complex joins and calculations on large datasets directly from the operational database.
  • Configuration Optimizations: Key techniques include adjusting PostgreSQL configuration parameters to optimize performance for both OLTP and OLAP activities, such as managing the number of concurrent connections and memory allocated for operations.
  • Indexing Strategies: Jex advises on creating specific indexes for analytics queries while balancing their impact on operational activities. She discusses the use of generated or materialized views to pre-calculate data, which can improve performance by allowing quicker access to aggregated results.
  • Replication Strategies: Finally, the talk explores options for offloading analytics from the main operational database by using physical or logical replication. This allows for dedicated resources for analytics tasks, minimizing impact on the primary workload.

In summary, Jex provides a comprehensive guide to managing hybrid database workloads, emphasizing practical techniques to maintain database performance while enabling robust analytical capabilities. The talk concludes by encouraging a balance between query efficiencies and the operational stability of databases, with the possibility of leveraging replication for dedicated analytics environments.

Optimising your database for analytics
Karen Jex • Sarajevo, Bosnia and Herzegovina • Talk

Date: September 11, 2024
Published: January 13, 2025
Announced: unknown

Your database is configured for the needs of your day-to-day application activity, but what if you need to run complex analytics queries against your application data? Let's look at how you can optimise your database for an analytics workload without compromising the performance of your application.

Data analytics still isn't always done in a dedicated analytics database. The business wants to glean insights and value from the data that's generated over time by your OLTP applications, and the simplest way to do that is often just to run analytics queries directly on your application database.

Of course, this almost certainly involves running complex queries, joining data from multiple tables, and working on large data sets. If your database and code are optimised for performance of your day-to-day application activity, you're likely to slow down your application and find yourself with analytics queries that take far too long to run.

In this talk, we'll discuss the challenges associated with running data analytics on an existing application database. We'll look at some of the impacts this type of workload could have on the application, and why it could cause the analytics queries themselves to perform poorly.

We'll then look at a number of different strategies, tools and techniques that can prevent the two workloads from impacting each other. We will look at things such as architecture choices, configuration parameters, materialized views and external tools.

The focus will be on PostgreSQL, but most of the concepts are relevant to other database systems.

EuRuKo 2024

00:00:07.240 hi thank you very much so I'm Karen JX
00:00:10.280 I'm a Senior Solutions architect at
00:00:12.440 crunchy data and I'm really happy to be
00:00:15.240 here at my first ever Yuko and my first
00:00:17.920 time in Saro so fantastic and I'm going
00:00:21.080 to talk to you about how to optimize
00:00:23.480 your database for
00:00:25.920 analytics uh this is even better than a
00:00:29.320 tldr this is a link to the slides from a
00:00:32.279 previous version of this talk so it's
00:00:34.120 not identical but almost the same so if
00:00:36.640 you don't even want to listen to me talk
00:00:38.600 you can just go to that link
00:00:44.000 there first of all whenever I'm talking
00:00:46.680 at a developer conference I have to get
00:00:48.800 the confession out of the way I'm not a
00:00:51.000 developer I'm really sorry but as you
00:00:53.760 can see from this diagram of my career
00:00:56.000 so far I do know about databases so I
00:00:58.760 was a database administrator data for
00:01:00.480 about 20 years before I was a database
00:01:05.000 consultant and now a Senior Solutions
00:01:07.960 architect but I still work exclusively
00:01:10.479 with
00:01:12.920 databases I've worked with a lot of
00:01:15.000 developers along the way and I've
00:01:17.159 learned Lots from the developers that
00:01:18.880 I've worked with and I really like to be
00:01:21.360 able to give back some of what I know
00:01:23.479 about datab
00:01:25.680 bases so what am I actually going to
00:01:27.920 talk about today and why
00:01:33.720 to answer that question I'm going to do
00:01:35.799 a quick recap of some of the different
00:01:38.000 types of database
00:01:39.880 activity you've probably got an
00:01:42.240 application or many applications with a
00:01:45.360 database backend and that database
00:01:47.439 backend is probably constantly being
00:01:49.680 queried updated inserted into by your
00:01:53.119 application
00:01:54.520 connections so it could be a traditional
00:01:56.920 payroll app it could be a banking system
00:02:00.000 an online Commerce system or one of all
00:02:02.079 sorts of different
00:02:06.920 applications this type of activity is
00:02:09.360 usually called oltp or online
00:02:11.840 transaction processing or sometimes just
00:02:14.480 your operational database
00:02:17.840 activity and it's characterized by
00:02:20.280 multiple concurrent users usually
00:02:23.519 performing um lots of short queries and
00:02:26.760 they'll often be doing multiple selects
00:02:28.959 inserts and up dates within a single
00:02:33.920 transaction and it's at this point
00:02:35.840 actually that I need to
00:02:37.599 apologize uh for my view of the world so
00:02:40.440 I see the
00:02:41.879 database and then everything else that
00:02:44.239 goes on outside the database so you'll
00:02:46.360 see the
00:02:47.319 users uh connected to the database
00:02:49.800 conveniently ignoring all of that
00:02:52.159 application layer so just imagine that
00:02:54.159 that's there in
00:02:57.519 between so most of these applications
00:02:59.959 will constantly be generating data in
00:03:03.000 your
00:03:03.840 database and that's going to gradually
00:03:06.040 build
00:03:07.000 up and at some point the business wants
00:03:10.080 to get value from that data it wants to
00:03:12.159 glean insights from that
00:03:14.519 data and for that you're going to need
00:03:16.680 to run analytics
00:03:20.120 queries so this analytics activity
00:03:23.200 activity can take various different
00:03:24.799 forms and it's got lots of different
00:03:26.280 names you might hear it called olap
00:03:28.080 online analytical processing reporting
00:03:30.879 decision support DS uh business
00:03:33.439 intelligence bi but it's basically any
00:03:37.760 queries any activities that serve to
00:03:40.560 answer real world business questions
00:03:43.400 such as uh how successful was this
00:03:45.840 marketing campaign or what would be the
00:03:48.319 impact on our carbon emissions of
00:03:50.200 Performing this action over this
00:03:56.879 one and this type of activities
00:03:59.200 characterize
00:04:00.439 by complex queries often performing
00:04:03.599 joins on multiple tables working on
00:04:06.439 large data sets so typically it's going
00:04:09.599 to be using a lot of resource it's going
00:04:11.599 to be using potentially a lot of CPU
00:04:13.840 memory and
00:04:15.919 IO once upon a time we had the oltp
00:04:19.840 databases and we had the analytics
00:04:22.040 databases and the two were very much
00:04:25.600 separate data would somehow be extracted
00:04:28.680 from your oltp database and loaded into
00:04:32.160 the olap systems uh which is sometimes
00:04:34.759 known as your data warehouse your data
00:04:36.680 Lake um and often that was done by a
00:04:39.520 nightly batch or even a weekly
00:04:43.800 batch that does sometimes still happen
00:04:46.280 and there are very good reasons for
00:04:47.720 that's still happening that way
00:04:49.199 sometimes but often businesses want to
00:04:51.880 get insights from the data instantly
00:04:55.320 they want it in real time directly from
00:04:58.039 the oltp system
00:05:01.600 and where you've got the oltp and the
00:05:04.960 olap activity
00:05:07.199 cohabiting um you'll often hear that
00:05:09.560 described as a mixed or hybrid
00:05:14.240 workload it's not a new idea Gartner
00:05:17.120 came up with the term htap hybrid
00:05:19.280 transaction analytical processing back
00:05:21.400 in 2010 uh and it's been described in
00:05:24.160 other places as trans litical or hybrid
00:05:26.759 operation and analytical processing
00:05:31.000 so what's the problem with
00:05:35.120 this it's well established in academic
00:05:37.800 circles that this type of workload
00:05:39.840 presents Unique Performance challenges
00:05:42.240 there are papers from back in 2010 2011
00:05:45.639 uh where the re researchers identified
00:05:48.319 and discussed the fact that running
00:05:50.360 mixed workloads causes performance
00:05:53.000 issues and there are ongoing discussions
00:05:55.479 about this and things like what kind of
00:05:58.479 benchmarks can be used used um to
00:06:00.840 monitor um this type of activity but
00:06:03.440 you'll probably be pleased to hear that
00:06:05.319 I'm not planning an academic discussion
00:06:08.199 I'd rather focus on what you can do if
00:06:11.520 you're in this situation how can you
00:06:13.479 make sure that you run performant
00:06:15.880 analytics with minimal impact on your
00:06:18.759 operational database
00:06:22.240 activity so to recap your database is
00:06:25.720 probably optimized for your day-to-day
00:06:28.319 oltp applic ation
00:06:31.080 activity adding analytics activity into
00:06:34.039 the mix so um complex long running um
00:06:38.560 queries that consume lots of resource
00:06:41.160 means you can end up with the worst of
00:06:43.120 Both Worlds you can have analytics
00:06:46.120 queries that are inefficient but that
00:06:48.560 also slow down your existing
00:06:54.039 application so now that we've identified
00:06:56.319 what the problem is that we're trying to
00:06:57.960 solve I set up an environment where I
00:07:00.759 can test things out as we start to look
00:07:02.759 at ways to fix those
00:07:05.400 issues I'm using postgress for this
00:07:07.840 because it's the rdbms that I know and
00:07:09.759 love and the one that I work with on a
00:07:12.000 day-to-day basis uh and also it's quick
00:07:14.720 and easy to set up and work with it was
00:07:16.879 very heartening to hear the postgress
00:07:18.599 for everything from UL just now uh and
00:07:21.680 in particular I'm using pgbench which is
00:07:24.800 a simple benchmarking tool that's
00:07:26.879 available by default with postgress
00:07:32.720 so first of all we create and populate
00:07:36.360 the PG bench tables by running PG bench
00:07:39.639 with the minus I the initialized flag in
00:07:43.199 this case I'm doing it in a database
00:07:45.680 that I created called
00:07:47.360 uro and I've asked it to give me a scale
00:07:50.319 factor minus s of 100 just so that I've
00:07:52.919 got some kind of um size to my tables
00:07:56.599 I've also asked it to create foreign
00:07:58.360 Keys between the table so that it uh
00:08:00.759 simulates somewhat an oltp type
00:08:04.400 database you see it shows us the
00:08:06.400 progress towards generating 10 million
00:08:08.199 rows of test data and it takes about 25
00:08:13.639 seconds and this is what it creates for
00:08:15.879 us it creates four tables PG bench
00:08:18.240 branches accounts tellers and history
00:08:22.639 the number of rows listed here is what
00:08:24.759 you'd get if you left scale factor at
00:08:27.280 the default of one so because I set it
00:08:30.360 to 100 I ended up with 10 million rows
00:08:32.719 in the pgbench accounts table the
00:08:35.560 pgbench history table will start off
00:08:38.000 with zero rows at the beginning of each
00:08:39.919 PG bench run and they'll be inserted as
00:08:42.039 you go along and I've included a link to
00:08:44.440 the documentation there just in case
00:08:46.040 anybody wants to look up how that
00:08:50.320 works by default pgbench simulates oltp
00:08:54.360 activity by repeatedly running this
00:08:56.920 transaction it's Loosely based on the T
00:08:59.440 cpcb
00:09:00.880 Benchmark each transaction is made up of
00:09:03.959 several SQL statements so you can see
00:09:06.240 there are three updates a select at an
00:09:08.720 insert and each one will take random
00:09:11.480 values of Aid tid B ID and Delta you can
00:09:17.760 uh pass in any script you want if you
00:09:19.800 want it to simulate other kinds of
00:09:24.920 activity so then when you want to
00:09:27.120 actually simulate some activity on the
00:09:28.640 database you run pgbench without the
00:09:31.519 minor SI flag you tell it how many
00:09:34.320 concurrent clients you want so in this
00:09:36.120 case I've used minor C1 for 10
00:09:38.160 concurrent clients and how many seconds
00:09:41.000 minus t you want it to run for so here I
00:09:43.560 run it for 12200 seconds or 20
00:09:46.959 minutes the output that you get will
00:09:49.279 show you things like how many
00:09:50.480 transactions were processed uh what the
00:09:53.000 connection time and Laten you were and
00:09:56.399 um what TPS throughput you got so in
00:09:59.200 this this case I got about 5,000 TPS
00:10:02.240 this is just running on my MacBook so
00:10:04.800 the goal of this isn't to try and get
00:10:07.360 the best possible throughput it's just
00:10:09.120 to give you an idea of what you might
00:10:13.120 see Okay so we've got a demo environment
00:10:16.480 with some tables and some data in
00:10:19.560 it then I mocked up an annual sales
00:10:23.519 report type analytics query against that
00:10:26.200 pgbench history table it manipulates the
00:10:29.000 data in the um M time column to get
00:10:32.279 something that looks like year and month
00:10:35.200 it groups and orders by year and month
00:10:37.360 and it calculates some totals so you'll
00:10:40.800 probably notice that as analytics
00:10:42.680 queries go this one is extremely simple
00:10:45.560 but it's good enough for demo
00:10:49.440 purposes and this is what the output
00:10:51.560 looks like against those 10 million rows
00:10:53.600 that were inserted into my PG bench
00:10:55.360 history table
00:11:00.519 okay so if we want to have a look at the
00:11:03.639 execution plan for that query um so we
00:11:06.399 can see what it's doing behind the
00:11:07.959 scenes we can use uh post's explain
00:11:11.480 analyze
00:11:12.600 command I connect to my database via
00:11:15.320 psql I type explain analyze and then my
00:11:19.120 query so the analyze keyword here tells
00:11:23.079 it to actually run the query so you'll
00:11:26.240 get real stats rather than an estimation
00:11:28.399 of what it's going to to do
00:11:31.920 so we wait for that to complete in this
00:11:34.560 case we can see that it took about 6
00:11:36.880 seconds just under six seconds to run
00:11:38.800 our analytics
00:11:40.519 query it's doing a sequential scan on
00:11:43.800 the pgbench history table so it's
00:11:45.360 reading the entire pgbench history table
00:11:48.519 and we can see we've got a sort method
00:11:50.560 external external merge dis so it's
00:11:54.360 actually writing some temporary files to
00:11:56.680 disk and doing the sorts there
00:12:00.160 just in comparison the select statement
00:12:03.519 in the pgbench uh transaction that we
00:12:06.320 looked at takes about a
00:12:10.760 millisecond Okay so we've got the
00:12:12.519 environment in place let's have a look
00:12:14.839 at some of the techniques that you can
00:12:16.800 use to optimize analytics activity
00:12:19.680 without slowing down your
00:12:21.760 application it's not going to be an
00:12:23.639 exhaustive list so the idea really is to
00:12:26.040 give you enough to get started and again
00:12:29.480 I'm using postgress for the details but
00:12:31.399 the ideas are applicable to other rdms
00:12:37.279 rdbms uh okay so configuration
00:12:39.680 parameters as I said your database
00:12:42.560 configuration parameters are probably or
00:12:44.839 at least hopefully optimized for your
00:12:47.000 day-to-day operational database activity
00:12:50.199 if not I highly recommend going to have
00:12:52.440 a look at
00:12:53.560 them and that gives me the perfect
00:12:55.920 excuse for a Shameless plug because this
00:12:58.199 is a link to a talk that I gave on
00:13:00.079 exactly that subject at Jango con EU
00:13:02.480 last
00:13:08.079 year okay some configuration parameters
00:13:12.120 need to be set across the entire
00:13:14.199 postgress instance for those you're
00:13:17.199 going to have to find some kind of
00:13:19.440 compromise to find the values that work
00:13:22.160 best for your hybrid workload but other
00:13:24.800 parameters can be set for specific users
00:13:28.320 um so you can set set them to different
00:13:29.920 values for your application users and
00:13:32.160 for your analytics users and some can
00:13:34.519 even be set for individual sessions or
00:13:37.680 queries I'm just going to walk through a
00:13:40.160 few of the different parameters and I'm
00:13:42.560 going to encourage you to test things
00:13:44.040 out uh to read the documentation and
00:13:46.720 also take note of the comments in the
00:13:48.320 postgress default configuration file
00:13:50.480 because they're actually quite
00:13:55.199 useful first of all database connections
00:13:58.440 and this is something thing that ulick
00:14:00.360 referred to in his talk as
00:14:02.560 well the chances are you won't have too
00:14:05.920 many concurrent analytics connections
00:14:09.480 but those connections are likely to be
00:14:11.279 using a lot of resource so you probably
00:14:13.600 want to make sure there aren't too many
00:14:15.199 of
00:14:15.920 them the max connections parameter is
00:14:19.560 set for the entire database instance uh
00:14:23.040 that tells you the maximum number of
00:14:25.920 current connections that are allowed to
00:14:27.800 the entire post ins
00:14:30.320 the default is 100 on most systems and
00:14:33.680 you probably don't want it much higher
00:14:35.639 than that definitely no more than a few
00:14:37.560 hundred if you can help
00:14:39.680 it and you probably want to limit the
00:14:42.440 number of analytics connections to much
00:14:44.360 less than that and to do that you might
00:14:47.240 want to look at connection poing so for
00:14:48.959 example PG bouncer so that you could
00:14:51.240 conate create a separate small pool for
00:14:53.880 those analytics
00:14:57.800 connections well workm is the maximum
00:15:00.360 amount of memory that can be used by a
00:15:02.480 query operation before it spills to disk
00:15:04.959 and creates a temporary file so we saw
00:15:07.000 those temporary files created in the
00:15:08.560 execution plan for our
00:15:10.120 query by default it's 4 megabytes but if
00:15:14.079 you've got queries that are creating um
00:15:16.320 performing complex uh sorts or hash
00:15:19.639 operations they might need a bigger
00:15:22.000 value and sometimes a much bigger
00:15:24.519 value you can check if work M
00:15:27.120 potentially needs to be increased by
00:15:29.000 setting log temp files if you set that
00:15:32.120 to zero it will log in your postest logs
00:15:35.480 every time one of these temporary files
00:15:37.560 is created so you can see that you're
00:15:39.120 spilling to disk and potentially need
00:15:41.360 more
00:15:43.079 workmen just watch out because if you've
00:15:45.399 got multiple concurrent users and you've
00:15:48.440 got lots of these complicated
00:15:51.199 queries um each of those could actually
00:15:53.720 be using many times workm so you can
00:15:55.680 very very quickly use up a lot of um a
00:15:58.759 lot lot of memory So to avoid using too
00:16:01.199 much you probably better to set it to a
00:16:03.800 high value just for certain queries that
00:16:06.199 or certain sessions that really need it
00:16:08.120 rather than across the whole in
00:16:13.000 instance I tested out uh the the
00:16:15.959 analytics query that I created with
00:16:17.639 different values of workmen so we
00:16:19.959 already saw with the default value of
00:16:21.800 four Meg it was taking around 6 seconds
00:16:24.279 and it was doing that sequential scan um
00:16:27.319 and those dis sorts so it was creating
00:16:29.959 the um the files on disk here I changed
00:16:33.360 it to 100 megabytes to see what that did
00:16:35.959 and regenerated the execution
00:16:39.639 plan just take it a minute to do that I
00:16:42.000 might not wait for it to do that but it
00:16:44.480 it took still nearly six seconds and it
00:16:47.560 still did an external dis sort so
00:16:51.120 setting it to a large value doesn't
00:16:53.600 always help all that much but it's very
00:16:55.839 much use case dependent
00:16:59.519 in the interests of time I'll just move
00:17:01.000 on from there and I even tried
00:17:02.800 increasing it right up to 1 Gigabyte to
00:17:04.720 see what happened that got rid of the
00:17:07.160 dis sort so we can now see a quite large
00:17:09.679 inmemory sort and it took the time down
00:17:12.160 to about 4 and a half seconds but that
00:17:13.880 to me doesn't seem like a fantastic
00:17:15.640 tradeoff for the sake of using a
00:17:17.520 gigabyte of memory just for that one
00:17:20.079 small analytics query obviously again it
00:17:22.760 will vary from case to case so try out
00:17:24.919 different values and see what impact it
00:17:26.640 has on your queries
00:17:31.000 if you're concerned that certain queries
00:17:32.919 might run for too long you can set
00:17:34.679 statement time out so that will abort
00:17:37.200 any statements that run for longer than
00:17:39.440 the specified amount of time by default
00:17:42.039 it's set to zero which is
00:17:44.120 disabled but you can set it either for
00:17:46.360 the whole instance or just for specific
00:17:48.360 sessions so it probably makes sense to
00:17:50.840 set it to an appropriate value for your
00:17:52.600 analytics sessions um so for example
00:17:55.400 your analytics query some of them it
00:17:57.039 might be okay if they run for many
00:17:58.440 minutes or even longer but you might
00:18:00.679 want your application statements to not
00:18:03.039 run for longer than a certain number of
00:18:07.640 milliseconds so as I said statement
00:18:09.760 timeout by default is set to zero which
00:18:12.760 is
00:18:15.320 disabled I set it here to 2,000
00:18:17.840 milliseconds or 2
00:18:25.640 seconds and then I tried again to run my
00:18:28.200 Analytics query just to see what effect
00:18:30.280 that has and check that it's working as
00:18:33.440 expected so now I can see that it
00:18:36.000 doesn't get to finish I get the error
00:18:37.760 cancelling statement due to statement
00:18:39.720 timeout after just over two
00:18:44.520 seconds if you want to make sure that
00:18:46.960 any statement that gets timed out in
00:18:48.919 this way is logged so you know that it's
00:18:50.640 happening make sure that log Min error
00:18:53.080 statement is set to error or lower so
00:18:56.039 you can see all the different valid
00:18:57.600 values here from the least severe debug
00:19:00.080 five up to the most severe which is
00:19:02.240 panic so if you set it to error then
00:19:05.600 error log fatal and panic messages will
00:19:09.600 all be written to your postest
00:19:14.480 logs indexing
00:19:17.400 strategy you'll probably need to create
00:19:19.960 specific indexes for your analytics
00:19:23.200 queries but obviously bear in mind that
00:19:25.840 query uh indexes have to be maintained
00:19:28.720 so they will have some impact on your
00:19:32.039 operational
00:19:34.000 activity so if you create too many
00:19:36.200 indexes on your application tables that
00:19:37.960 can start to slow down your inserts and
00:19:40.200 updates so you again need to get some
00:19:42.200 kind of balance
00:19:44.640 there as well as creating indexes on one
00:19:47.600 or more table columns you can also
00:19:49.559 create indexes on uh functions or scaler
00:19:53.039 Expressions computed from your table
00:19:54.960 columns so our analytics query for
00:19:57.720 example can obtains Expressions to
00:20:00.320 calculate something that looks like year
00:20:02.400 and month from the M time
00:20:04.960 column if we created an index on the M
00:20:08.080 time column then the query planner
00:20:10.360 wouldn't be able to use it for that
00:20:11.880 particular
00:20:14.760 query so what we can do instead is that
00:20:18.280 we can create an index on the
00:20:19.799 Expressions that are used in the query
00:20:21.840 so in this case 2,000 plus hour and
00:20:25.200 modulo 12 of minutes
00:20:32.559 so that takes longer to create than it
00:20:34.799 would if I just created it on the index
00:20:36.559 because it's got to do those
00:20:37.559 calculations as it goes along if we then
00:20:40.320 analyze our query again then we should
00:20:43.120 see that it's actually using that index
00:20:45.240 that we
00:20:55.360 created okay so we can see that it's
00:20:57.400 doing index scan using using that index
00:20:59.679 that we just created takes the query
00:21:02.200 time down to just over 4 seconds in this
00:21:04.960 case uh and we don't have any of those
00:21:07.440 dis sorts uh obviously in this case
00:21:10.200 we're selecting all of the data so
00:21:11.799 potentially using an index doesn't help
00:21:13.440 as much as it
00:21:14.720 could um you know case-by case basis and
00:21:18.240 just be aware that it will have an
00:21:20.320 impact on your operational activity so
00:21:22.640 you need to um Benchmark that to see if
00:21:25.360 it's acceptable
00:21:28.919 most analytics queries aggregate sort
00:21:31.559 and calculate large amounts of data so
00:21:34.919 they do totals averages comparisons
00:21:37.279 groupings Etc and these obviously are
00:21:39.960 resource intensive operations so it can
00:21:42.720 be useful to pre-calculate some of that
00:21:44.880 data so pre-calculate pre-sort
00:21:47.600 pre-aggregate um to make those queries
00:21:49.840 less expensive at
00:21:52.799 runtime if we create columns in a table
00:21:55.559 with the generated always as keyword
00:21:58.279 then
00:21:59.039 the column value will be automatically
00:22:01.279 calculated based on the expression so in
00:22:03.919 this case on the slide uh the sum of
00:22:06.960 columns A and B the stored keyword says
00:22:10.640 that it should be actually physically
00:22:12.919 stored in the table rather than just
00:22:14.960 calculated at query time um so in that
00:22:18.640 select you can see that c has been
00:22:20.320 calculated
00:22:21.440 automatically again it will have some
00:22:24.480 impact on inserts into that table so you
00:22:27.039 need to Benchmark that and see see
00:22:28.919 whether that's an acceptable
00:22:33.360 tradeoff so in our case we can add year
00:22:36.240 and month columns to our pgbench history
00:22:38.440 table telling postgress to generate the
00:22:40.919 value automatically using the
00:22:42.679 definitions that we've got in our um
00:22:45.640 reporting query now in this case it's
00:22:48.320 going to take a while because I've
00:22:49.559 already got those um 10 million rows in
00:22:52.840 the table so it's got to calculate those
00:22:54.640 values and insert them into those
00:22:57.080 columns
00:23:00.320 um so now once that's in we can use
00:23:04.200 those calculated columns in our query
00:23:06.919 instead of actually putting the
00:23:09.039 expression so I can just select year and
00:23:11.320 month so that takes the query of time
00:23:13.960 down to just over 3 seconds sorry I
00:23:16.000 should have left that to wait for
00:23:21.279 longer a materialized view is a physical
00:23:24.840 copy of the results of a query uh that
00:23:27.720 can be queried just like a normal
00:23:30.000 table material view materialized views
00:23:32.840 are useful um for preag aggregating data
00:23:36.760 and they're particularly useful if you
00:23:39.200 don't need completely upto-date
00:23:41.480 data if um for example data from 10
00:23:46.240 minutes ago an hour ago even a day ago
00:23:48.320 is acceptable you choose how often you
00:23:50.880 want to refresh that materialized view
00:23:52.919 to keep the data fresh enough for your
00:23:54.960 reporting purposes and you can also
00:23:57.600 create indexes on materialized views to
00:23:59.679 make them even more
00:24:02.039 performant so to create a materialized
00:24:04.320 view that stores the results of our
00:24:06.000 analytics queries we can use create
00:24:08.080 materialized view as and then use our
00:24:11.640 select
00:24:13.679 statement and then we can just select
00:24:16.120 year month and total sales from the
00:24:18.080 materialized view instead of running the
00:24:20.039 analytics
00:24:21.039 query uh so this takes it down to a
00:24:23.640 couple of milliseconds because it's only
00:24:25.720 having to select from that 14 row
00:24:27.760 materialized view instead of from the
00:24:30.200 entire
00:24:31.080 table in the real world our analytics
00:24:33.960 query is probably just one small
00:24:35.679 building block that's potentially used
00:24:37.600 in many different reporting queries so
00:24:40.120 you could rewrite all of those reporting
00:24:41.799 queries to select from the materialized
00:24:44.000 view instead of from that base
00:24:49.200 table and of course the best way to um
00:24:53.399 improve performance of something is
00:24:55.120 usually by not doing it so if you can
00:24:59.360 don't run analytics queries on your main
00:25:02.559 database at
00:25:03.799 all even if you don't have a completely
00:25:06.360 separate analytics environment there are
00:25:08.559 probably some things that you can
00:25:10.799 do most database environments already
00:25:13.720 have a high availability architecture in
00:25:15.559 place so that means you've probably got
00:25:17.320 one or more standby databases that's
00:25:20.200 replicating from your primary so it's
00:25:23.120 kept in sync with your primary and
00:25:25.120 available for readon transactions so you
00:25:27.960 could send send your analytics workload
00:25:29.520 to one of those to take the load off
00:25:31.240 your primary
00:25:32.960 database just be aware that physical
00:25:35.279 replication gives you an exact copy of
00:25:37.360 your primary database you can't create
00:25:39.559 separate users you can't create indexes
00:25:41.840 you can't change the layout of the
00:25:43.039 schema or anything like that any changes
00:25:45.559 that you need for your analytic workload
00:25:47.559 would need to be made on your primary
00:25:49.360 database and replicated
00:25:52.320 across so another option would be to use
00:25:55.279 logical replication you can use that to
00:25:57.799 replicate
00:25:58.760 just certain objects so just certain
00:26:01.000 schemas or tables from a publisher
00:26:03.000 database to a subscriber
00:26:05.760 database and since postest 16 uh which
00:26:09.039 came out in 2023 you can do logical
00:26:11.760 replication from a standby which takes
00:26:13.760 additional pressure of your primary
00:26:18.039 database logical replication is much
00:26:20.520 more complicated to set up and maintain
00:26:22.440 Than Physical replication but it does
00:26:24.399 give you a lot of flexibility you can
00:26:26.799 replicate just a selection of object
00:26:28.320 objects you can replicate to and from
00:26:30.520 multiple targets and
00:26:32.159 sources uh your subscriber database so
00:26:34.960 your replicated database is available
00:26:36.760 for read write so you can create indexes
00:26:39.240 users materialized views or anything
00:26:41.399 else that you need just on that
00:26:43.120 subscriber
00:26:47.559 database so in summary we've looked at
00:26:50.120 various different techniques that you
00:26:51.520 can use to make sure your database is
00:26:53.559 optimized for analytics without slowing
00:26:56.480 down your application or at least only
00:26:58.399 slowing it down an acceptable
00:27:00.679 amount if you can use physical or
00:27:03.240 logical replication to create a separate
00:27:05.240 database for your analytics workload
00:27:07.960 tune your configuration
00:27:09.880 parameters uh so that they are a good
00:27:13.360 compromise for your uh hybrid
00:27:16.200 workload consider your indexing
00:27:19.200 strategies create generated columns or
00:27:22.480 materialized views so that you
00:27:23.960 pre-aggregate and pre-calculate data
00:27:28.640 and don't forget that you can combine
00:27:30.600 all of those techniques you can create
00:27:32.279 indexes on your materialized views you
00:27:34.240 can change your configuration parameters
00:27:36.000 and do all of these other
00:27:40.559 things thank you so much this is a link
00:27:43.720 again to that previous version of the
00:27:45.240 slides I'm in the process of creating an
00:27:46.960 annotated version which will be
00:27:48.440 available on my
00:27:49.880 blog and I need to do a quick
00:27:52.480 congratulations to erat and jurri who
00:27:54.880 have won a signed copy of Andrew
00:27:56.399 Atkinson's book so come and see me
00:27:58.559 afterwards and I will hand that over to
00:28:00.640 you thank you very much
Explore all talks recorded at EuRuKo 2024
+39