dballe help to analize and optimize very big explorer

scanning some bufr imported in arkimet I get a very big explorer.

du -ksh *.xapian
1,7G	report_fixed.xapian
4,0K	report_mobile.xapian
2,5M	sample_fixed.xapian
4,0K	sample_mobile.xapian

bufr data are big but so big summary is strange and I get also some strange values from summary; data was imported in arkimet so I expect no special problems in this kind of data.

Size of arkimet dataset with bufr:

du -ksh report_fixed
30G	report_fixed

to generate explorer I use this:

#!/usr/bin/python3
import dballe

from pathlib import Path

dstypes=["report_fixed","report_mobile","sample_fixed","sample_mobile"]

for dstype in dstypes:
    for path in Path(dstype).rglob('*.bufr'):
        file=path.as_posix()
        print(file)

        # update from file
        with dballe.Explorer(dstype+".xapian") as explorer:
            with explorer.update() as updater:
                importer = dballe.Importer("BUFR")
                with importer.from_file(file) as message:
                    try:
                        updater.add_messages(message)
                    except Exception as e:
                        print (e)
                
            print ("updated from file")
            #print (explorer.all_reports)
            #print (explorer.all_levels)
            #print (explorer.all_tranges)
            #print (explorer.all_varcodes)
            print (explorer.stats)

interrupted after 24 hours of work with:

#!/usr/bin/python3
import dballe

from pathlib import Path

dstypes=["report_fixed","report_mobile","sample_fixed","sample_mobile"]

for dstype in dstypes:
    with dballe.Explorer(dstype+".xapian") as explorer:

        print (explorer.all_reports)
        print (explorer.all_levels)
        print (explorer.all_tranges)
        print (explorer.all_varcodes)
        print (explorer.stats)

explorer.stats say:

ExplorerStats(datetime_min=datetime.datetime(2000, 1, 1, 0, 0), datetime_max=datetime.datetime(2021, 6, 22, 23, 0), count=1117667617)
[]
[]
[]
[]
ExplorerStats(datetime_min=None, datetime_max=None, count=0)
['fixed', 'luftdaten']
[dballe.Level(103,1000,None,None), dballe.Level(103,2000,None,None), dballe.Level(265,1,None,None), None]
[dballe.Trange(0,0,60), dballe.Trange(254,0,0), None]
['B01011', 'B01019', 'B01194', 'B01213', 'B04001', 'B04002', 'B04003', 'B04004', 'B04005', 'B04006', 'B05001', 'B06001', 'B10004', 'B12101', 'B13003', 'B15195', 'B15198', 'B15202', 'B15203', 'B15242', 'B49193', 'B49194', 'B49195', 'B49196', 'B49197']
ExplorerStats(datetime_min=datetime.datetime(2021, 6, 23, 0, 0), datetime_max=datetime.datetime(2021, 6, 23, 23, 59, 59), count=8934327)
[]
[]
[]
[]
ExplorerStats(datetime_min=None, datetime_max=None, count=0)

I get the attached output log.zip

Thanks in advance for any suggestions how to analize the problem

Jun 30 '21 16:06 pat1

sembra che alcuni problemi derivino da strani bufr. allego un esempio.

dbamsg dump problema.bufr
#0 BUFR message: 262 bytes, origin 200:0, category 0 255:255:0, bufr edition 4, tables 14:1, subsets 1, values: 51/51:
Subset 0:
001194 Report mnemonic(CCITTIA5): fixed
004001 YEAR(YEAR): 2019
004002 MONTH(MONTH): 7
004003 DAY(DAY): 31
004004 HOUR(HOUR): 13
004005 MINUTE(MINUTE): 45
004006 SECOND(SECOND): 0
001011 SHIP OR MOBILE LAND STATION IDENTIFIER(CCITTIA5): cmricci
005001 LATITUDE (HIGH ACCURACY)(DEGREE): 44.00035
006001 LONGITUDE (HIGH ACCURACY)(DEGREE): 12.65516
007192 First level type(NUMERIC): 1
004192 Time range type(NUMERIC): 1
004193 Time range P1(NUMERIC): 0
004194 Time range P2(NUMERIC): 900
013011 TOTAL PRECIPITATION / TOTAL WATER EQUIVALENT(KG/M**2): 0.0
007192 First level type(NUMERIC): 103
007193 Level L1(NUMERIC): 2000
004192 Time range type(NUMERIC): 0
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 301.84
013003 RELATIVE HUMIDITY(%): 57
004192 Time range type(NUMERIC): 1
014198 Global radiation flux (downward)(W/M**2): -1
004192 Time range type(NUMERIC): 581
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 1.57
013003 RELATIVE HUMIDITY(%): 62
004192 Time range type(NUMERIC): 57
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 2.21
013003 RELATIVE HUMIDITY(%): 60
004192 Time range type(NUMERIC): 56
004194 Time range P2(NUMERIC): -3670016
012101 TEMPERATURE/DRY-BULB TEMPERATURE(K): 0.29
013003 RELATIVE HUMIDITY(%): 62
007193 Level L1(NUMERIC): -418906107
004192 Time range type(NUMERIC): 880
004194 Time range P2(NUMERIC): -536346624
011002 WIND SPEED(M/S): 360.0
004192 Time range type(NUMERIC): 0
011041 MAXIMUM WIND GUST SPEED(M/S): 0.8
011043 MAXIMUM WIND GUST DIRECTION(DEGREE TRUE): 0
004192 Time range type(NUMERIC): 0
011001 WIND DIRECTION(DEGREE TRUE): 100
011002 WIND SPEED(M/S): 0.0
004192 Time range type(NUMERIC): 0
004194 Time range P2(NUMERIC): -3670016
011001 WIND DIRECTION(DEGREE TRUE): 0
011002 WIND SPEED(M/S): 0.0
007192 First level type(NUMERIC): 0
007193 Level L1(NUMERIC): 19398656
025025 Battery voltage(V): 0.0
025192 Battery charge(%): 81
025193 Battery current(A): -6.584

problema.zip

Mi pare strano si riescano a scrivere dei bufr tramite API con metadati non validi ...

Jul 07 '21 09:07 pat1

Direi che se la dimensione di un explorer cresce, vuol dire che c'è un'entropia alta nei metadati che vengono accorpati dall'explorer, che sono: stazione (lat, lon, ident, rep_memo), livello, timerange, varcode.

Puoi fare un'analisi dell'explorer grosso stampando quanti elementi ci sono in all_stations, all_levels, all_tranges, e all_varcodes, e vedere per quali di quelli entrano dati ad alta entropia.

Una volta capito cosa fa esplodere l'explorer, c'è da capire se è legittimo. Per esempio, importando dati di aerei latitudine e longitudine cambiano in continuazione, e quindi le stazioni acquisiscono entropia alta, e l'explorer non può accorpare dati.

Nel tuo caso mi sembra di capire che livello e timerange contengono inaspettatamente del rumore, e se è cos'consiglierei di andare a investigare la causa di quel rumore.

Nov 23 '21 14:11 spanezz