influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

Empty prefix alias for wildcards aggregate functions

Open steverweber opened this issue 8 years ago • 56 comments

Moving and downsampling data between databases or retention policies should allow us to maintain the field names.

I tried

SELECT last(*) AS ""
INTO "mfcf_vmware"."autogen".:measurement
FROM /^vmware_.*/
WHERE time >= '2016-01-01T00:00:00Z'
GROUP BY time(30m), "host"

but sadly because the prefix key is empty it fallbacks to "last_" in this case.

https://github.com/influxdata/influxdb/pull/7009 @jsternberg would adding this be simple? @gunnaraasen should something different be done so it uses backreferences?

steverweber avatar Sep 20 '16 13:09 steverweber

Unfortunately, this wouldn't be easy. We needed some way to uniquely identify the result and there had to be a prefix of some kind to the AS. If you have any recommendations, we can try to make this kind of thing possible. What is the desired field name you want out of that query?

jsternberg avatar Sep 21 '16 20:09 jsternberg

What is the desired field name you want out of that query?

The original name, anything else it going to be an issue because tools like grafana would need to be updated to use the new field names... and this is an issue because I want to allow the users to jump between the different retention/rollups without worry about field name changing

The only workaround i see, would be to do an export then use sed to update the values and preform a restore. /not user-friendly/

untill this is resolved I have little use for continuous queries

please add this to a milestone. its kinda required for anything /enterprise/ level unless im missing something fundamental?

steverweber avatar Sep 21 '16 20:09 steverweber

The original name is likely not possible. It creates a conflict if you do something like SELECT first(*), last(*) FROM cpu since now you have two wildcards that create the same field name. From your mention of Grafana, I think your issue might be relevant to #7198. Does it look like that issue covers yours?

jsternberg avatar Sep 21 '16 20:09 jsternberg

i don't see a conflict here

SELECT first(*), last(*) FROM cpu
first_idle first_user last_idle last_user

this creates fields with prefix first_ and last_

if you did

SELECT first(*) AS "x", last(*) AS "x" FROM cpu
error

then yes you'll have a conflict

if you did

SELECT first(*) AS "x", last(*) AS "y" FROM cpu
x_idle x_user y_idle y_user

then its good

if you did

SELECT first(*) AS "", last(*) AS "y" FROM cpu
first_idle first_user y_idle y_user

WOULD PREFER: idle user y_idle y_user

it should also be OK... if the preferred outcome was the result.

steverweber avatar Sep 21 '16 21:09 steverweber

Given an example where cpu has the fields idle and user, what do you think the output column names should be for each of the above?

Just a warning. At the moment, it's unlikely we will physically be capable of changing this since I think it would be a backwards incompatible change.

jsternberg avatar Sep 21 '16 21:09 jsternberg

updated above.

also i don't see this breaking any backwards compatible because if you use an empty alias then you would not have used an alias in this current state.

steverweber avatar Sep 21 '16 21:09 steverweber

cc @jwilder since this is a breaking change. @steverweber we, unfortunately, will likely not be able to change the current behavior. If you have any proposal for a way to make your current use case work without changing the existing behavior, we can look into that. Can you look at #7198 and see if it's relevant to your problem? Thanks.

jsternberg avatar Sep 21 '16 21:09 jsternberg

Another workaround to add an empty prefrix. Add something like the CQ back reference to AS

Yes i looked at #7198 its on an interesting path. I was hoping to see something like this and glad its on the radar.. however it's not the solution i'm looking for (im trying to copy data between two different databases and roll up the data)... However I could just copy all the data without the group by and wait for a clean solution in another release.

SELECT *
INTO "mfcf_vmware"."autogen".:measurement
FROM /^vmware_.*/

steverweber avatar Sep 21 '16 21:09 steverweber

I think I may have misunderstood the issue. I think this might be possible and may not be a breaking change. I thought this would require a change in what the alias modified since, for whatever reason, I thought the alias modified the field name, but the alias modifies the function name.

I'll see if we can do this.

jsternberg avatar Sep 22 '16 19:09 jsternberg

+1 any news on this?

playground-io avatar Dec 08 '16 00:12 playground-io

Why I want this:

  • Reusing the same queries with multiple retention policies e.g. in Grafana

Full story on a simplified example:

  • I have e.g. three retention policies "raw", "store30d" and "store1y"
  • Telegraf pushes data to the "raw" RP; This RP only covers the past 7d
  • A continuous query aggregates data from the "raw" RP and stores it in "store30d" (e.g. SELECT mean(*), max(*) AS max, min(*) AS min INTO store30d.:MEASUREMENT FROM raw./.+/ GROUP BY time(1m),*)
  • More aggregated data goes into "store1y" SELECT mean(*),... INTO store30d.:MEASUREMENT FROM raw./.+/ GROUP BY time(1h),*)
  • If I don't give an alias, the function name is prefixed to the field name. That's as expected
  • If I do give an alias, the alias name is prefixed to the field name. That's nice, too
  • But I didn't find a way to keep the original field name other than by using explicit field names and aliases instead of the asterisk. Problem: This doesn't scale. I'd then have 50 different CQs that need to be maintained.

Current workaround:

In Grafana I use the retention policiy as a template parameter. Currently I have multiple queries with the different field names (usage_percentile, mean_usage_percentile), but that approach puts at least 50% more queries to influxdb that would be needed.

Potential Solution

I like @steverweber's initial approach of using an empty alias. IMHO, it's OK if the query would fail if the field names aren't unique anymore, but as long as the querier makes up queries that result in unique field names, this feature would be very helpful. Proposals for syntax

  • SELECT mean(*) AS "" FROM raw.mem GROUP BY time(1h)
  • SELECT mean(*) NOALIAS FROM raw.mem GROUP BY time(1h)
  • SELECT mean(*) AS FIELDNAME FROM raw.mem GROUP BY time(1h) Shall all return the same field names as in the raw.mem series.

I'm not sure if SELECT first(*) - last(*) FROM mem should work (returning differences per field (e.g. first_used_percent-last_used_percent, first_active-last_active,...)

bbczeuz avatar Feb 20 '17 11:02 bbczeuz

@bbczeuz thanks for spelling out the issue.

steverweber avatar Feb 20 '17 13:02 steverweber

https://github.com/influxdata/influxdb/issues/7858

another syntax that would be ok:

SELECT mean(*) AS :FIELD FROM measurentname

steverweber avatar Feb 20 '17 14:02 steverweber

If it is small change then releasing it sooner will be an high impact one for us :+1: . Any ETA would be great @jsternberg

skyrocknroll avatar Feb 27 '17 12:02 skyrocknroll

Like many others in this thread, we also feel limited by the use of CQ's and retention policies. Very nicely spelled out by @bbczeuz .

I'd like to add that the solution could be implemented in such a way that many other issues regarding alias names can be solved as well. When not dealing with CQ's at all, but just defining a SELECT mean(*) from "mymeasurement" in Grafana, gives a lot of unwanted pollution in the legend of the graph, and its tooltip.

I also saw some discussion about adding prefixes or postfixes to the column names. This only increases the pollution... column names will get longer and longer.

Sometimes we want to fan out a measurement, with shorter column names and a more descriptive new measurement name.

How about this for a solution?

Use a two-parameter AS phrase to enter regex replace alias mode

For example

SELECT mean(*) AS "^(mean_)(MyOld)(.*)$" "MyNew_$3" FROM measurementname

Cheers, TW

tw-bert avatar Mar 29 '17 09:03 tw-bert

@jsternberg will this issue be considered for implementation?

I guess this one is probably easier in implementation than #7198, but will help quite a lot to those who use RP-CQ pattern of downsampling to access raw and downsampled data in the uniform way.

And last proposed syntax from @steverweber: SELECT mean(*) AS :FIELD FROM measurentname looks elegant.

dreamon-dd avatar Apr 27 '17 10:04 dreamon-dd

While a @tw-bert -ish solution is clearly more powerful:

SELECT mean(*) WITH "^(mean_)(MyOld)(.*)$" AS "MyNew_$3" FROM measurementname

WITH <match> AS <replace> could appear multiple times. Is this SQL-ish? or is WITH confusing?

I think @steverweber 's solution would still fit most cases (and is easier to understand and probably easier to implement and faster to run).

bbczeuz avatar Apr 27 '17 10:04 bbczeuz

A community user ran into this reciently: https://community.influxdata.com/t/query-problem-removing-prefix-to-field-name-of-into-clause/1006

jackzampolin avatar May 22 '17 16:05 jackzampolin

+1

I'm setting up downsampled dashboards in Grafana and the only option I see is having one panel for the raw data, and one panel for the downsampled data to accomodate the sum_ or mean_ prefix.

I see the reasoning for having the prefixes:

SELECT sum(*), mean(*) INTO "metrics"."one_day".:MEASUREMENT FROM /.*/ GROUP BY time(1d),*

Selecting from one_day, there'd be no way to know if it was the sum or the mean field. Perhaps another way is having the aggregate in the retention policy. 'one_day_sum' and 'one_day_mean'.

This still requires a way to SELECT INTO the retention policy without the prefix. The create continuous query command would need to error if you tried creating a continuous query like the one I've got above; only allowing one aggregate function per retention policy if you specify the ideal 'no prefix' option.

eWilliams35 avatar Oct 09 '17 17:10 eWilliams35

Clear that select sum(*), mean(*) ... will not work in this case, however I would work with AS smth, like AS :FIELD or AS "MyNew_$3" as it was proposed before instead of prohibiting the syntax with multiple aggregate functions. Therefore if you would need to make both sum and mean you can add your own prefixes. Because in some cases there is a need of using different aggregate functions in one query for fields of different datatypes, like mean() will make no sense for a string data, probably you would use mode() instead, same for boolean values.

dreamon-dd avatar Oct 10 '17 06:10 dreamon-dd

Is there anything happening here? I am trying to find a way to properly implement downsampling/aggregation with is - in my opinion - a core feature of a time series database.

After quite some reading and research I implemented different retention policies and continuous queries to downsample the data. Just to find out that the is no option to keep the field name? Downsampling in influxdb is already quite painful to implement and there is no way to do an intelligent rollup (I know there is an open issue for that). But this has nothing to do with downsampling when the data looks completely different afterwards.

Since there has been no activity around that issue I assume it is not really important for the devs. For our projects, this issue renders influxdb basically useless since we HAVE to downsample the data because of the sheer amount. But building a separate dashboard for every level of aggregation is not an acceptable solution.

Crapworks avatar Feb 20 '18 14:02 Crapworks

@Crapworks We have similar requirements. We resorted to creating our own CQ generator, which gives us plenty of downsampling possibilities. So we took the long way of circumventing this issue, but in turn, it gave us greater flexibility. I can recommend doing it like that (but still: I'm not dismissing the need of a fix for this issue for anyone of course).

tw-bert avatar Feb 20 '18 15:02 tw-bert

@tw-bert So if I get you right you are creating different CQs for every table in a database and expanding the field like mean(field_name1) as field_name1, mean(field_name2) as field_name2? Just out of curiosity, how many CQs do you have running?

Crapworks avatar Feb 20 '18 15:02 Crapworks

I've been watching this as well. If you're using Grafana to view your data, you can use regular expressions in the field name to get around the raw data vs aggregated granularity issue.

SELECT last(/(sum_)?partition_records/) FROM "$granularity"."measurement" --rest of query ---

This assumes you create a template variable ($granularity) with 'show retention policies on ...'.

eWilliams35 avatar Feb 20 '18 16:02 eWilliams35

@ewillia1983 I didn't know that, thank you! This is indeed helping a lot in dashboard creation!

Still, this is a workaround for Grafana and the main issue should still be addressed by InfluxDB

Crapworks avatar Feb 20 '18 16:02 Crapworks

@ewillia1983 Interesting. I suppose it comes with a little bit of a performance hit though.

kerams avatar Feb 20 '18 18:02 kerams

@Crapworks Yes, that's the idea. But sometimes you need more specific CQ's, for example a max() on one field, and no function at all on an accompanying text field. We run about 80 CQ's atm. Some are for downsampling, and some are for precalculations like these: https://stackoverflow.com/questions/38016051/how-do-i-get-consistent-values-with-influxdb-non-negative-derivative/43200947#43200947

tw-bert avatar Feb 20 '18 18:02 tw-bert

@kerams - Good thought, I have not tested it. I just came up with this a few days ago, I've only got one panel switched over so far.

eWilliams35 avatar Feb 20 '18 18:02 eWilliams35

Any news on this long standing issue ? It is really painful to use CQs with this current behavior on field names.

Being forced to modify every metric name in every graph in every dashboard for Grafana is ... not enjoyable to say the least.

Kaelnor avatar Mar 30 '18 08:03 Kaelnor

Is it really possible that such basic functionality for timeDB is not after 2 years still implemented?

metf avatar Aug 23 '18 10:08 metf