vscode-data-preview
vscode-data-preview copied to clipboard
add data.world integration for public data sets display and explore
Synopsis
Data Preview 🈸 is used by thousands of developers and data scientists wordwide for a quick preview of large flat file data sets in Visual Studio Code today.
data.world is an online data catalog platform used by many Fortune 500 companies and journalists for data collaboration, exchange and analysis by data teams and domain experts.
Visual Studio Code is a primary IDE (integrated development environment) favored by millions of developers, data scientists and data engineers due to its unmatched support for many programming languages, remote development environments and systems.
Online Data Challenges
One of the many challenges developers and data teams face today are easy access to online systems, public and private data repository for exploratory data analysis (EDA).
Many REPL (read evaluate print loop) online systems provide means for exploring data in notebooks format that fosters data story telling interleaved with text information about data sets, data manipulation and graphs.
Jupyter Notebooks is a dominant industry accepted platform for data analysis with support for Python, R, Juli and Scala programming languages.
VS Code IDE provides excellent extensions for those languages and working with Jupyter Notebooks.
What's lacking is a consistent way of getting public and private data sets into VS Code for developers and data teams to easily access and analyze data with Jupyter Notebooks or via other means in that IDE.
VSCode Data Solution
Recently Microsoft created new vscode.workspace.fs api that allows VS Code extension developers to connect to remote data sources and create data providers for different online data and file systems.
The next major iteration of Data Preview 🈸 will utilize new remote data access API's for retrieving data files over http, ftp, webdav and other well established URI based data transfer protocols.
Data Preview already supports many text and binary data formats for a quick preview and data manipulation or basic charting.
Data.World Integration Proposal
As part of the extended remote data support in Data Preview, we would like to add data.world connector for navigating and exploring projects and data sets from this enterprise online data catalog.
Primary Use Case
Domain experts and business analysts can create data projects and upload artifacts to data.world to collaborate with developers and data analysts.
The latter group will be able to connect to those public or private data repositories hosted by data.world to retrieve or update data files from VSCode using Data Preview extension.
Furthermore, they can filter and cleanse those data sets and save them in more suitable formats such as CSV, JSON, or Apache Arrow supported by Data Preview today for further data analysis with Jupyter notebooks or via custom JavaScript, Java, or Python scripts.
Proposed Implementation Details
The next major release of Data Preview will feature new dedicated Data Preview Tree View Container UI for connecting to remote data servers for retrieving data. This includes FTP, HTTP data services with required credentials, and other similar remote and cloud file storage systems.
As part of this implementation, we would add custom data.world Data Connector with projects, data sets and files Tree View display using public OpenAPI (a.k.a. Swagger) dwapi-spec.
Targeted Features
For the first iteration of this data.world Data Preview integration, we would like to target the following features:
- Authenticate data.world user and save credentials for projects and data sets introspection.
- List projects data.world user has access to
- List project datasets
- Download project data files
- Preview data files in Data Preview
- Save and upload modified data files
Out of scope:
- Data Streaming
- SQL and SPARQL Queries
Note: data.world Streams support might be added in later iterations once Data Preview also provides those capabilities for larger data sets.