shlink icon indicating copy to clipboard operation
shlink copied to clipboard

Prevent slow queries by indexing long URLs via hash column

Open ardabeyazoglu opened this issue 3 months ago • 5 comments

Shlink version

4.5.2

PHP version

8.3

How do you serve Shlink

Docker image

Database engine

MySQL

Database version

8.4.5

Current behavior

I have shlink setup in a high loaded environment, and i noticed one particular query pattern bloat the db and eventually crash or slow down.

The query is as below. Since original_url is not indexed, the query makes full table scan and create a load. If multiple queries appear at the same time, they saturate all cpu cores.

Here is the relevant file and line

SELECT
s0_.id AS id_0,
s0_.original_url AS original_url_1,
s0_.short_code AS short_code_2,
s0_.date_created AS date_created_3,
s0_.valid_since AS valid_since_4,
s0_.valid_until AS valid_until_5,
s0_.max_visits AS max_visits_6,
s0_.import_source AS import_source_7,
s0_.import_original_short_code AS import_original_short_code_8,
s0_.title AS title_9,
s0_.title_was_auto_resolved AS title_was_auto_resolved_10,
s0_.crawlable AS crawlable_11,
s0_.forward_query AS forward_query_12,
s0_.domain_id AS domain_id_13,
s0_.author_api_key_id AS author_api_key_id_14
FROM
short_urls s0_
INNER JOIN short_urls_in_tags s2_ ON s0_.id = s2_.short_url_id
INNER JOIN tags t1_ ON t1_.id = s2_.tag_id
AND (t1_.name = 'kvacc8280')
INNER JOIN short_urls_in_tags s4_ ON s0_.id = s4_.short_url_id
INNER JOIN tags t3_ ON t3_.id = s4_.tag_id
AND (t3_.name = 'kivacrm')
INNER JOIN short_urls_in_tags s6_ ON s0_.id = s6_.short_url_id
INNER JOIN tags t5_ ON t5_.id = s6_.tag_id
WHERE
s0_.original_url = 'https://myapp.com/?q=...'
GROUP BY
s0_.id,
s0_.original_url,
s0_.short_code,
s0_.date_created,
s0_.valid_since,
s0_.valid_until,
s0_.max_visits,
s0_.import_source,
s0_.import_original_short_code,
s0_.title,
s0_.title_was_auto_resolved,
s0_.crawlable,
s0_.forward_query,
s0_.domain_id,
s0_.author_api_key_id
HAVING
COUNT(t5_.id) = 2
LIMIT
1;

Expected behavior

Since the url is a text column and it can be very long, it is not clever to index it but maybe create a hash string, index it, and query it. I would recommend creating/indexing a generated virtual column such as sha256(original_url), and use it instead of this.

-- create the column and the index
ALTER TABLE short_urls
ADD COLUMN original_url_hash BINARY(32) AS (UNHEX(SHA2(original_url, 256))) STORED,
ADD INDEX idx_original_url_hash (original_url_hash);

-- query hash instead of the url
SELECT ... FROM short_urls WHERE url_hash = UNHEX(SHA2('a long url', 256));

Another recommendation in addition to this, adding a configurable (or maybe hardcoded) max execution time limit to these kind of select queries by using (e.g. /*+ MAX_EXECUTION_TIME(5000) */), to prevent executing these queries for a very long time not to crash the db.

Note that, some people (like me :)) might be using a shared database for shlink and other apps.

ardabeyazoglu avatar Sep 29 '25 10:09 ardabeyazoglu

I was convinced by someone at one point that there should be no limit on the long URL length and hence, it should be a TEXT field, and I have regretted the decision multiple times since.

Sadly, it is now tricky to revert this decision, because changing a column type is a very slow query if there are many rows, and in some DB engines it even blocks the table.

On top of that, anyone with very long values might end up with broken entries.

You suggestion of having an additional indexed column with a long URL hash sounds good, and would work around those issues. However, it has a few considerations that I need to check, being:

  • Multiple database engines support
  • Database abstraction library support
  • Performance impact when creating new short URLs
  • Performance impact when editing the long URL of an existing short URL
  • Performance impact when listing/querying for short URLs by their long URL
  • Ability to filter lists by partial long URL
  • Redirect rules (the short_urls table is not the only one where long URLs are stored)
  • Ensuring the hash and the long URL don't accidentally end up desynced

But at first glance, I like the idea. I'll look into it.

acelaya avatar Sep 30 '25 07:09 acelaya

Thanks for the quick answer.

@acelaya I think the decision of using TEXT is not wrong, we can't know what type of urls to deal with in advance. Using a computed hash sounds like the best option to me either way.

Multiple database engines support Database abstraction library support

As for db support, I don't know if Doctrine supports them as i never use it, however all major rdbms support generated columns.

EDIT: Doctrine supports it.

Performance impact when creating new short URLs Performance impact when editing the long URL of an existing short URL

Since the average url length is quite short, sha256 impact should be almost zero considering it will be computed on the fly. If the particular database supports it, "virtual" instead of "stored" might be even more efficient as only the index records will use the computation. But it should be negligible either way.

Performance impact when listing/querying for short URLs by their long URL

If I understand this correct, this is literally what caused me the headache. For equality filters, indexed hash check must be instant. In any case, any non-indexed columns must not be allowed in any filter in short_urls/visits table I think. These tables grow a lot, and a full table scan is not acceptable in these circumstances.

Ability to filter lists by partial long URL

I don't how much flexibility you offer with this, so can't comment much. If there is a need to filter partially by using url segments, i think an auxillary table should be used to store segments in a structured way with proper indexes. Though, I am not sure if it would be worth the effort and resources.

EDIT: Either way, this is not blocking anything actually. It can still use the old way of searching if partial filter is required. At least until for a better method is implemented. This is not urgent.

Redirect rules (the short_urls table is not the only one where long URLs are stored)

As long as the long url can be used in WHERE clause for that table, all tables with long url should have the same computed hash I think.

Ensuring the hash and the long URL don't accidentally end up desynced

Why would they be desynced ? generated columns will provide the consistency automatically.

ardabeyazoglu avatar Sep 30 '25 10:09 ardabeyazoglu

@acelaya do you have a roadmap for this change? I have to decide if i should wait for it or go on with a custom fork as it is a serious problem for us. I am going with archiving short_urls table and recreate empty for now, to increase query speeds but this is not a permanent solution.

ardabeyazoglu avatar Oct 06 '25 08:10 ardabeyazoglu

I work on this project in my free time, so there's no roadmaps.

If you are going to fork and try to tackle this, I'm willing to review a PR, if you want.

acelaya avatar Oct 06 '25 08:10 acelaya

I will check what I can do.

ardabeyazoglu avatar Oct 06 '25 08:10 ardabeyazoglu