handle documents with different fields

Open kaem2111 opened this issue 7 years ago • 1 comments

This is an enhancement proposal.

When retrieving documents with different fields from an elastic index (e.q. index="metricbeat-*" query="*") then the first document determines the names of the columns of the whole table! The content of further documents with other fields are not shown, because there is no corresponding columnname.

The following modification inserts an additional first document with all fields of all documents (and a _time value < 0 to be filtered out later). The header fields are determined depending of the scan option:

if scan=false, the columns are collected by looping through the full hits list
if scan=true, the columns are extracted from an esclient.indices.get_field_mapping call

You can additionally determine the display sequence of the columns with the fields-parameter, e.g. fields="beat.name,system.load.*,beat.*" will show _time and beat.name first, then all system.load fields and after that the remaining beat-fields (without beat.name of course).

Unfortunaly I am not familiar with pull requests/github development, therefore here a code proposal (could be modified as you like) as follows:

# KAEM BEGIN extension to get column names via get_field_mapping
#       if self.scan:  # does not work, because is string type and always true
        if self.scan in ["true", "True", 1]: 
            head = OrderedDict()
            head["_time"] = -2
            f0 = config[KEY_CONFIG_FIELDS] or ['*']
            res = esclient.indices.get_field_mapping(index=config[KEY_CONFIG_INDEX], fields=f0)
            for nx in res:
                for ty in res[nx]["mappings"]:
                    for m0 in f0:
                        for fld in sorted(res[nx]["mappings"][ty]):
                            if fld in head: continue
                            if fld.endswith(".keyword"): continue
                            if re.match(m0.replace('*', '.*'), fld): head[fld]=""
            yield head
#KAEM END

            # Execute search
            res = helpers.scan(esclient, 
            ....
       else:
            res = esclient.search(index=config[KEY_CONFIG_INDEX],
                                  size=config[KEY_CONFIG_LIMIT],
                                  _source_include=config[KEY_CONFIG_FIELDS],
                                  doc_type=config[KEY_CONFIG_SOURCE_TYPE],
                                  body=body)

# KAEM BEGIN extension to get column names via hits scanning
            head = OrderedDict()
            head["_time"] = -1
            head0 = {}
            f0 = config[KEY_CONFIG_FIELDS] or ['*']
            for hit in res['hits']['hits']:
                for fld in self._parse_hit(config, hit): head0[fld] = ""
            for m0 in f0:
                for fld in sorted(head0):
                    if fld in head: continue
                    if re.match(m0.replace('*', '.*'), fld): head[fld] = head0[fld]
            head["_time"] = -1  # setup again, because overwritten by hits in meantime
            yield head
#KAEM END

May 23 '18 18:05 kaem2111

Hi @kaem2111,

I wasn't aware of this issue. I'll look into testing and adding your changes.

Thanks for tracking this :)

May 26 '18 13:05 brunotm