311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Pull data from previous years (e.g. 2023, 2022, etc)

Open traycn opened this issue 6 months ago • 5 comments

Overview

We need to pull data from 2023, 2022, etc. to show data from the previous year in our application for users to make more extensive searches.

At this time, the site is limited to display data of the current year to date.

Action Items

For the Proof of Concept that we can query multiple files: DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html

  • [ ] Register a new file in the newDb instance to pull data from another 311-data/[year searched here] repo loc: components/db/DbProvider.jsx
  • [ ] Query the new file using SQL loc: components/Map/index.js

For the Proof of Concept that we can make a query when a user makes search: loc: components/Map/index.js

  • [ ] Pass values of the dates being searched to the query function
  • [ ] Write a SQL query using the search param values
  • [ ] Set the data to populate on the map: SetData() //??

More Information:

The following is rough runthrough of the control flow for how data is currently being populated.

Step 1: A Parquet of the LA Open Data - 311 Call's is populated in the HuggingFace repo

https://huggingface.co/datasets/311-data/2024

Step 2: The HuggingFace repo is defined in the datasets.parquet.hfYtd value

loc: components/db/DbProvider.jsx - line 7

// List of remote dataset locations used by db.registerFileURL
const datasets = {
  parquet: {
    // huggingface
    hfYtd:
      '[HUGGINGFACE REPO URL HERE]'
     ...
  },
 ...
};

Step 3: The datasets.parquet.hfYtd value is used to register a new File

loc: components/db/DbProvider.jsx - line 55

// register parquet
await newDb.registerFileURL(
  'requests.parquet',
  datasets.parquet.hfYtd,
  4    // HTTP = 4. For more options: https://tinyurl.com/DuckDBDataProtocol
);

Step 5: The DbContext (later used as this.context) is defined and passed to the application

loc: components/db/DbProvider.jsx - line 109

<DbContext.Provider value={{ db, conn, worker }}>
  {children}
</DbContext.Provider>

Step 6: The Data is queried and set to the front-end application

loc: components/Map/index.js - line 66, 76
....
createRequestsTable = async () => {
  const { conn } = this.context;

  // Create the 'requests' table.
  const createSQL =
    'CREATE TABLE requests AS SELECT * FROM "requests.parquet"'; // parquet

  await conn.query(createSQL);
};

async componentDidMount(props) {
  this.isSubscribed = true;
  this.processSearchParams();
  await this.createRequestsTable();
  await this.setData();
}
Previous Notes

1 - The parquets are in separate huggingface repos so, I’m not sure if we can query multiple files as shown in the duckdb doc here. … A potential solution would be putting the parquet files in a single repo (but consider the limitations of huggingface repos, doc here.

2 - We may have to make a GET call in order for this to work and I’m not sure if we have the capabilities to run a GET call after the application loads. … Note: My understanding of how data is pulled is that it’s pulled once, ...... at the beginning when the application loads through a duckdb initialize() (loc: components/db/DbProvider.jsx line: 88) ...... and set in the <DbContext.Provider value={{..}}>
(loc: components/db/DbProvider.jsx line: 108) 2.5 - So, my question now is, can we make API calls without a backend? Can we run an Express server to make the call?

3 - If we can't run an Express server, we can potentially look into putting 2024, 2023, 2022, etc. parquet data into a single huggingface repo and reference the aforementioned doc here to execute a query that gatther all the data on load.

Resources/Instructions

DuckDB docs - https://duckdb.org/docs/api/wasm/overview DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html Huggingface repo limitations - https://huggingface.co/docs/hub/repositories-recommendations

traycn avatar Feb 02 '24 20:02 traycn