WizardLM icon indicating copy to clipboard operation
WizardLM copied to clipboard

Empty and orphaned instructions

Open sanderland opened this issue 11 months ago • 0 comments

The dataset on huggingface contains many empty instructions. You can find these using:

import datasets
ds = datasets.load_dataset("WizardLM/WizardLM_evol_instruct_V2_196k")
empty_ids = []
for r in ds['train']:
    if r['conversations'][0]['value']=='':
        print(r)
        empty_ids.append(r['idx'])
print(len(empty_ids),"empty:",empty_ids)

Outputs examples such as

{'idx': 'uwUskKh', 'conversations': [{'from': 'human', 'value': ''}, {'from': 'gpt', 'value': 'A pivot table is a summary table that lets you arrange data in various ways, helping you summarize and analyze large data sets. It allows you to manipulate data in a number of ways to see different perspectives of the data. Pivot tables can be used to create summaries, count data, calculate percentages, and more.'}]}

and 240 ids:

240 empty: ['mQlSpXY', '1XZ33Um', 'AUqQFGL', 'T13eJHG', 'fHPVoOX', 'Qe9TqOb', '203q5Y8', 'c0jEzjw', 'l2vmmzy', '6fpbyet', 'Dk2SPuA', '2okWfpl', 'gWdwZZx', '007U6yx', 'b5DOnro', '5yLZrrI', 'B06nXol', 'OF40x3L', '0cnh9bi', 'EALn0yj', 'J6w5qY3', 'ouIsOFc', '2eKl5ro', 'BqKXH7I', '8xZSOVr', 'zqQa6sO', 'MhYZB5W', 'YH3pfqH', 'ZLaVtrz', 'RYgWvFF', '8USkHok', '7kQEzRU', 'O9asDoO', 'qhzzcdy', 'MaAmkA5', 'EOf07xY', 'uRui3Tx', 'ygEB0tH', 'WM3NEhO', 'Nb80dUr', 'RlHQMEj', '4PPBWUU', 'Lfr6JTo', 'EF1wPgG', 'd686zAH', 'pJtPGtW', 'EHbZOMc', 'JJu53iW', 'e2GpwbK', 'iGJImXo', 'Z50pBKR', 'qXnxo8C', 'z9okO1B', 'TIQxyVj', 'FsRsQcH', '8lmGs77', 'HJhlusD', 'L1eXjpg', 'aMgb9Un', 'ofbwOJI', 'fINS4I1', 'er0EBLw', 'voH9CHc', 'WxeSqOE', 'my428lw', '7TdUJsD', 'ZgtoeGU', 'OI5TzX0', 'wkubp2M', '7Uc04ss', 'IXyr5Zb', '2mWmf2Y', 'MwNytnK', 'mQlSpXY', 'mi3kIZ7', 'ISzjrGW', 'z6p5BEr', 'MOQ9Vlh', 'cK4LV5U', 'vZWyKGa', 'eI9lQXQ', '05Q7yeX', '35WaL8B', 'KJwV5p9', '7Lfbo02', 'wLn0ZhY', 'cuCftWB', 'txtbFgw', 'rg1iI7C', '9cNucRt', 'ZdwWij9', '4Svqckd', 'nAqAZVV', 'lG64JuN', 'Wnbympx', 'eyT73S3', 'ICSRFHn', '6Yfj9A3', 'ITTcFaF', 'mQb5h5E', 'TxuQTou', 'G44amNd', 'uTJDqUf', '6cSHUWC', 'Zhj03gx', 'notYi1O', 'eV58lm8', 'v1HdHbk', 'QnpwpHS', 'bZKyiOE', '3BZA4Wx', 'HnMff0L', 'YbeBbhK', 'fAnhOXN', 'ZLuurGK', 'xSD3WjY', 'cS1hkob', 'yQLHHb6', 'VD1MyV9', 'KKGltOT', 'BS24vZA', 'e03vKhl', 'nvQMOfk', 'OY6zxGs', '457LXnv', 'WIn0uT4', 'gD3M4fo', 'F9IXDti', 'qxRmAlf', 'pag7cpt', 'S5LYQui', 'Bhgsb9K', 'e91eq3p', '91jKuNT', 'CuzJ2vX', '9P7QkHO', '06r3kDq', 'UblRF22', 'hdD6yKq', 'E2ZWd8u', 'MMHHECb', 'kj0gEKT', 'WI1lzcx', 'OFkW1dt', 'jUqlQcq', '5sYfOo5', 'kvDFJDV', 'fPoIUEa', '4S9vuE1', '3O3k7Ni', 'ZHU5XwD', 'jdr1XAI', 'GmFvyAB', 'aKZB5V8', 'Y8QOJL1', 'YtkxXVW', 'Vd91QY0', 'FiocSWZ', 'qay7Kc3', '97ygA9a', '1zkhOYq', '7Uw8u27', 'Evq1C5s', 'Cmku0Y3', 'HyOto5f', 'L8c4SeN', 'FKirFRg', 'YbVE3rL', 'tvXVQGI', 'kDVbcMQ', 'fij4plV', 'M2Pg6hK', 'uIg0EfD', 'nLgLIRp', '1naX7G6', 'lxpLo5L', '1YjlEli', 'ojG9AYj', 'asamoZf', 'Le6YUxn', 'H0FMtrb', 'mWsAQD2', 'oU4HkI8', 'ZyHt8il', 'Wl45lzd', 'kxFcxKk', '4UwEIgW', 'cBeGtBo', 'oTwSJkH', 'OFkW1dt', 'Xl6ZzUu', 'rAYYUit', 'NWzv7lW', 'lTPIHwG', '3DcClR6', 'uJinBUA', '4xxS8lo', '25JUXlt', 'pJtPGtW', 'tgutqyy', '9bDHAEm', 'Di18mFG', 'K7AsKHZ', 'YiKuJHN', 'dPWFLhL', 'tiybjs8', 'S2rJqCG', 'eV58lm8', 'FLrNuAf', 'VYSLTMt', 'I6ZawgG', 'Bs9iB2o', '9rNfDDB', '5lPpjGu', 'CSUXpGQ', 'wpKUEf5', 'aAdTjSD', 'yysfQk4', 'ig5G2QQ', 'lG64JuN', 'OQqrCWt', 'IbwGePu', '0IFEMjM', 'uZmHNrE', 'qLLtcO4', 'l2uDNVc', 'vQzLXb0', '0fXaWSr', 'dMu2FhP', '1a96pcp', '2yczPqP', 'OFkW1dt', 'VIQWgVZ', 'gB6KCVj', 'lGWJFZl', 'JwWmQ0d', 'Rk8fEVd', 'uwUskKh', '9BjORXr', 'O6Z88xm']

In addition there are instructions that appear to refer to previous conversational turns not included, such as:

{'idx': 'OrU7VFN', 'conversations': [{'from': 'human', 'value': 'I apologize for the inconvenience, but the input provided lacks essential information necessary for generating a detailed and comprehensive Excel table. In order to construct a table that meets your requirements, I request you to provide more specific details regarding the type and structure of data that you want included. Additionally, to avoid oversimplifying the process, could you please break down the information into smaller chunks or steps?\r\n\r\nFurthermore, I would like to pose a series of follow-up questions that require multiple-step reasoning to answer. Firstly, could you identify the project with the highest return on investment based on the specific data points presented in the table? Secondly, could you outline the factors that influenced your decision-making process? Finally, could you elaborate on the selection process you used to arrive at your conclusion? Your answers will provide me with a better understanding of your thought process and enable me to deliver more accurate results.'}, {'from': 'gpt', 'value': 'Additionally, could you provide some context about the purpose of the Excel table? Knowing the intended use of the table can help me better understand how to structure and format the data to best suit your needs. For example, if the table will be used for presenting to stakeholders, we may want to include additional details such as charts and graphs to enhance the visual impact. Alternatively, if the table will be used for data analysis, we may want to include additional calculations and formulas to allow for further analysis of the data.'}]}

sanderland avatar Jul 16 '23 14:07 sanderland