Wednesday, May 10, 2017

Re: [google-cloud-sql-discuss] Re: PostgreSQL: to increase shared_buffers

I already wrote above what I can't to optimize the regexes because them will be to set the users and users will be owners this datas. I only can to change the order of the regexes: first a includes or first a excludes.
But we didn't try to change the caches, buffer etc. and can't know whether it could help. 

среда, 10 мая 2017 г., 22:21:30 UTC+4 пользователь paynen написал:
Hey Alexsander,

I believe I have a pretty clear understanding of what's going on here, now. I've edited my prior post to remove speculation which didn't actually correspond to the likely cause.

It seems that in the case of your data, specifically, the condition you moved to the top of the chained "AND" statement (url ~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+$') is more likely to be false than the other conditionals, leading to a logical short-circuit of the chained "AND" statement sooner and more often than with another order. If you'd like to test this theory, you could do the following:

1. Determine how many rows evaluate as true for each of the regex conditionals alone
2. Run (and record the timing for) queries which put the regex conditionals in various orders

You could then note whether the runtime is shortest for the queries where the conditions evaluating as false for most rows are put closer to the first position, and whether the runtime is longest for those queries where the conditions evaluating as true for most rows are closer to first position.

If this turns out to be a correct intuition, you should run this kind of analysis on your data every so often to make sure that the order in which you place the conditionals is optimized.

You could also pre-compute the conditional value for various useful and recurring regexes and store these as columns, making it much faster to query based on these conditionals.

Cheers,

Nick
Cloud Platform Community Support

On Wednesday, May 10, 2017 at 1:20:59 AM UTC-4, Aleksander Efremov wrote:
select * from "discovered_urls" where domain_id=5 and is_fetched=false and task_created=false and
(url ~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+$') AND -- include regex should be in first position
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/index$' AND -- exclude regex
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/thema\\/[-/a-z0-9]+$' AND
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/seite-[0-9]$' AND
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/komplettansicht$'
limit 101


вторник, 9 мая 2017 г., 23:35:31 UTC+4 пользователь paynen написал:
Hey Aleksander,

Very interesting results. It seems that the query was not in any significant way being slowed-down by lacking any parallel CPU's, or it could not take advantage of extra CPU's, and also didn't benefit from increasing the amount of available memory. A CPU with a higher GHz rating would definitely have shown improvements, although at present that's not an option. 

Intuitively, it makes sense that include -> exclude would be faster than exclude -> include, as it's simpler to gather all rows then remove rows, rather than dropping a bunch of rows then trying to add rows, but for each row, needing to make sure it can be added by checking the exclusion regex again. But once again, I'm no expert on PSQL query planning.

For the benefit of other users, could you clarify the new order of regex's that you used?

Cheers,

Nick
Cloud Platform Community Support

On Sunday, May 7, 2017 at 10:58:43 AM UTC-4, Aleksander Efremov wrote:
I had acceleration approximately by 8 times when I placed the include regex on the first position and exclude regex on the following: 24 seconds instead of 1 minute 30 seconds.
But I also tried to increase CPU and Memory and I almost not receive the improvements: 22 seconds instead of 23-24.

пятница, 5 мая 2017 г., 0:39:29 UTC+4 пользователь paynen написал:
Hey Alexsander,

The question was merely a diagnostic one - to see whether the change would affect the runtime. I find it very odd that enhancing the CPU and RAM simultaneously didn't even change the runtime at all, but that's something to keep in mind as we try to gather more information. Testing what effect the number of regular expressions has could be part of that process. As I said above, I'm no expert on Postgres and maybe the query planner is such that it doesn't matter, but we might as well find out.

Cheers,

Nick
Cloud Platform Community Support

On Thursday, May 4, 2017 at 1:13:37 AM UTC-4, Aleksander Efremov wrote:
Regular expressions will be set by users. Therefore I can't to opportunity to execute any optimizations.
I want to have opportunity to change caches, buffers, etc. I have wrote above.

четверг, 4 мая 2017 г., 2:28:11 UTC+4 пользователь paynen написал:
Hey Alexander,

I'm not much of an expert on Postgres, but maybe you can see if combining some of the regular expressions affects the duration?

Cheers,

Nick
Cloud Platform Community Support

On Wednesday, May 3, 2017 at 5:21:38 AM UTC-4, Aleksander Efremov wrote:
I tried to increase of RAM, size of SSD (read/write rate),  number of CPU, but it didn't bring results. Query is always execute approximately one minute 30 seconds. 
How I can to speed up query?

Here http://pgtune.leopard.in.ua/ I have received settings, but how I can to set they:
```
max_connections = 100 shared_buffers = 2559744MB effective_cache_size = 7679232MB work_mem = 26211778kB maintenance_work_mem = 2GB min_wal_size = 1GB max_wal_size = 2GB checkpoint_completion_target = 0.7 wal_buffers = 16MB default_statistics_target = 100
```

Query:
```
select * from "discovered_urls" where domain_id=5 and is_fetched=false and task_created=false and
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/index$' AND
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/thema\\/[-/a-z0-9]+$' AND
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/seite-[0-9]$' AND
url !~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+\\/komplettansicht$' and
(url ~ E'^https?:\\/\\/www\\.zeit\\.de\\/[-/a-z0-9]+$') limit 101
```

Plan:
```
Limit  (cost=341.52..42790.50 rows=1 width=132)
  ->  Bitmap Heap Scan on discovered_urls  (cost=341.52..42790.50 rows=1 width=132)
        Recheck Cond: (domain_id = 5)
        Filter: ((NOT is_fetched) AND (NOT task_created) AND (url !~ '^https?:\/\/www\.zeit\.de\/[-/a-z0-9]+\/index$'::text) AND (url !~ '^https?:\/\/www\.zeit\.de\/thema\/[-/a-z0-9]+$'::text) AND (url !~ '^https?:\/\/www\.zeit\.de\/[-/a-z0-9]+\/seite-[0-9]$'::text) AND (url !~ '^https?:\/\/www\.zeit\.de\/[-/a-z0-9]+\/komplettansicht$'::text) AND (url ~ '^https?:\/\/www\.zeit\.de\/[-/a-z0-9]+$'::text))
        ->  Bitmap Index Scan on discovered_urls_domain_id_is_fetched_task_created_index  (cost=0.00..341.52 rows=14487 width=0)
              Index Cond: ((domain_id = 5) AND (is_fetched = false) AND (task_created = false))

```

вторник, 2 мая 2017 г., 20:12:29 UTC+4 пользователь Aleksander Efremov написал:
ok, thanks. I will try to increase of memory. And then I will to see have it sense or not. Maybe I am mistaken and it can't be help.

вторник, 2 мая 2017 г., 19:58:11 UTC+4 пользователь Vladimir Rusinov написал:
Hi Alexander and thanks for feedback!

shared_buffers is a difficult param as it has performance vs stability / recovery time trade-offs and we don't want to compromise those.
We did some testing and found out that setting shared_buffers to around 30% of the instance memory yields best performance for tpc-b/c -like workloads while not compromising other features of the system.
So that's what we set shared_buffers to on sufficiently large instances (with > 3.5G ram as far as I remember).

Note that PostgreSQL will use OS-level disk buffers as well, so if your dataset fits in memory you are likely to see very little read operations regardless of value of shared_buffers.

All that said, we are certainly open for more tuning and we are eager to learn about different workloads. Would you be able to elaborate why do you think increasing shared_buffers benefit your workload?

Thanks!

--
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047

On Mon, May 1, 2017 at 10:10 PM, Aleksander Efremov <mr.e...@gmail.com> wrote:
I execute the select query with the RegExp in condition on 2 million and above records and it spend much time.

вторник, 2 мая 2017 г., 9:05:13 UTC+4 пользователь Aleksander Efremov написал:
When you plan to inject this feature?

--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/00e196c7-2710-438e-a0ce-69ddb9632c69%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/9e350155-2e0d-4f94-96af-db63d7d60b7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment