Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Data source: datasheets?

Open mm12 opened this issue 2 years ago • 10 comments

Could we get datasheets of a bunch of popular chips, MCUs, etc, and feed them into training data? Alternatively (or also) perhaps use libraries that the manufacturer provides.

example use case: Q: "why wont my code work [code] on my board" A: "You are trying to use the WIFI, but should be using WIFI0, which is defined in the library for your board" and Q: "why wont my sensor respond" A: "you need to send it the 0xAF command to turn off sleep mode"

mm12 avatar Feb 16 '23 22:02 mm12

I don't see why not, licenses permitting. Would you be interested in working on this?

olliestanley avatar Feb 16 '23 22:02 olliestanley

I don't see why not, licenses permitting. Would you be interested in working on this?

Possibly, what would I need to do, and how?

mm12 avatar Feb 16 '23 22:02 mm12

Possibly, what would I need to do, and how?

The general process for providing datasets is just to create a Jupyter notebook in notebooks/data-augmentation in the repo, which will download the data (you can upload it to Hugging Face or similar if it isn't already easily available). The notebook should convert the data to a simple Q-A format which we need for training, e.g. JSONL where each line has prompt and response, and write it locally. Then you can make a PR with the notebook (but don't include the downloaded data itself)

olliestanley avatar Feb 16 '23 22:02 olliestanley

The general process for providing datasets is just to create a Jupyter notebook in notebooks/data-augmentation in the repo, which will download the data (you can upload it to Hugging Face or similar if it isn't already easily available). The notebook should convert the data to a simple Q-A format which we need for training, e.g. JSONL where each line has prompt and response, and write it locally. Then you can make a PR with the notebook (but don't include the downloaded data itself)

I haven't used Jupyter a lot (or complicated Python, honestly), but if I am understanding, I think I could give it a try. Looking at other entries in notebooks, what your saying is I need to:

  1. Use an API to get data from sources (Perhaps use octopart or something? )
  2. Somehow make them all plaintext (typically would be a PDF)
  3. Somehow split the data up into questions and answers (Not sure if there is a good way to automate this)

mm12 avatar Feb 16 '23 22:02 mm12

I haven't used Jupyter a lot (or complicated Python, honestly), but if I am understanding, I think I could give it a try. Looking at other entries in notebooks, what your saying is I need to:

  1. Use an API to get data from sources (Perhaps use octopart or something? )
  2. Somehow make them all plaintext (typically would be a PDF)
  3. Somehow split the data up into questions and answers (Not sure if there is a good way to automate this)

Pretty much. See below an example notebook which grabs a bunch of Reddit comments and applies some logic to convert them to Q-A format. Not sure how easy it would be to do reliable conversion to plaintext if you can only get them in PDF format though. I also have no clue how licenses would work with something like Octopart.

https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/changemyview-builder/data_processor.ipynb

olliestanley avatar Feb 16 '23 23:02 olliestanley

I haven't used Jupyter a lot (or complicated Python, honestly), but if I am understanding, I think I could give it a try. Looking at other entries in notebooks, what your saying is I need to:

  1. Use an API to get data from sources (Perhaps use octopart or something? )
  2. Somehow make them all plaintext (typically would be a PDF)
  3. Somehow split the data up into questions and answers (Not sure if there is a good way to automate this)

Pretty much. See below an example notebook which grabs a bunch of Reddit comments and applies some logic to convert them to Q-A format. Not sure how easy it would be to do reliable conversion to plaintext if you can only get them in PDF format though. I also have no clue how licenses would work with something like Octopart.

https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/changemyview-builder/data_processor.ipynb

I wonder if I could find ones that have a format like this:

/values for this thing/ Define keyword value; // first one and could restructure it like this: Q: when using this thing, how do you do first one? A: for first one on thing, you can use keyword, which is value and in datasheets: Q: How do I do (description of thing) A: Use (thing being described)

The problem is datasheets and drivers are typically copyrighted so I don't know how we could..

mm12 avatar Feb 17 '23 00:02 mm12

stuff like this is ideally what I was thinking https://github.com/ArduCAM/Energia/blob/master/hardware/tools/msp430/msp430/include/msp430fr6989.h

mm12 avatar Feb 18 '23 01:02 mm12

I have assigned to you @mm12. Thank you!

huu4ontocord avatar Feb 20 '23 13:02 huu4ontocord

I have assigned to you @mm12. Thank you!

Do you know if there is anyone who can help me figure out licensing stuff ?

mm12 avatar Feb 20 '23 16:02 mm12

I am not sure @mm12. It depends one where you are located. If you don't feel comfortable with the license, i would recommend you not pursue it. We want to do the right thing :)

huu4ontocord avatar Feb 24 '23 06:02 huu4ontocord

Yeah, I don't have the knowledge to do this. If anyone is interested in doing this, feel free

mm12 avatar Feb 26 '23 18:02 mm12