Data Scientist vs. Data Manager

long-lived-data-collectionsSo, what exactly is a data scientist? How does this role compare to a data manager?

The authors of a 2005 National Science Foundation (NSF) report defined five actors in data management: data users, authors, managers, scientists and funding agencies. Today, I will examine the data scientist vs. the data manager.

First, what are the shared goals of the five actors in data management?

  • ensure that all legal obligations and community expectations for protecting privacy, security, and intellectual property are fully met;
  • participate in the development of community standards for data collection, deposition, use, maintenance, and migration;
  • work towards interoperability between communities and encourage cross- disciplinary data integration;
  • ensure that community decisions about data collections take into account the needs of users outside the community;
  • encourage free and open access wherever feasible; and
  • provide incentives, rewards, and recognition for scientists who share and archive data (NSF, 2005).

In order to fulfill these goals, an organization will need one or more individuals who can fulfill the role of data scientist and data manager. I say, “one or more”, simply because I believe that at one time or another, a researcher may find him- or herself acting as the sole data user, author, manager, and scientist.

DATA SCIENTIST

So, how do the authors of the NSF report define data scientist?

The interests of data scientists – the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection – lie in having their creativity and intellectual contributions fully recognized. In pursuing these interests, they have the responsibility to:

  • conduct creative inquiry and analysis;
  • enhance through consultation, collaboration, and coordination the ability of others to conduct research and education using digital data collections;
  • be at the forefront in developing innovative concepts in database technology and information sciences, including methods for data visualization and information discovery, and applying these in the fields of science and education relevant to the collection;
  • implement best practices and technology;
  • serve as a mentor to beginning or transitioning investigators, students and others interested in pursuing data science; and
  • design and implement education and outreach programs that make the benefits of data collections and digital information science available to the broadest possible range of researchers, educators, students, and the general public.
  • conduct creative inquiry and analysis;
  • enhance through consultation, collaboration, and coordination the ability of others to conduct research and education using digital data collections;
  • be at the forefront in developing innovative concepts in database technology and information sciences, including methods for data visualization and information discovery, and applying these in the fields of science and education relevant to the collection;
  • implement best practices and technology;
  • serve as a mentor to beginning or transitioning investigators, students and others interested in pursuing data science;
  • and design and implement education and outreach programs that make the benefits of data collections and digital information science available to the broadest possible range of researchers, educators, students, and the general public (NSF, 2005).

The authors go on to state that the insights provided by data scientists to the research process have contributed to the research results and literature. A data scientist with a fundamental understanding of the data’s representation can and does complement the domain specialist’s knowledge. They recommend that these contributions be recognized by first authorship of peer-reviewed publications of the research results.

DATA MANAGER

Now, let’s examine how the authors of the NSF report defined a data manager.

Data managers – the organizations and data scientists responsible for database operation and maintenance – have the responsibility to:

  • be a reliable and competent partner in data archiving and preservation, while maintaining open and effective communication with the served community;
  • participate in the development of community standards including format, content (including metadata), and quality assessment and control;
  • ensure that the community standards referenced above are universally applied to data submissions and that updated standards are reflected back into the data in a timely way;
  • provide for the integrity, reliability, and preservation of the collection by developing and implementing plans for backup, migration, maintenance, and all aspects of change control;
  • implement community standards through processes such as curation, annotation, technical standards development and implementation, quality analysis, and peer-review (some of these functions, defined in this report as community-proxy functions, apply primarily to resource and reference collections and may not apply to many research collections);
  • provide for the security of the collection;
  • provide mechanisms for limiting access to protect property rights, confidentiality, privacy, and to enable other restrictions as necessary or appropriate;
  • encourage data deposition by authors by making it as easy as possible to submit data; and
  • provide appropriate contextual information including cross-references to other data sources.

To be successful, the data manager must gain the trust of the community that the collection serves. Thus, collections policy should emphasize the role of the community in working with data managers (NSF, 2005).

AND THE DIFFERENCE IS….

I see the difference as primarily between working with the data itself vs. providing the (management and technical) infrastructure around the data. That is, a data scientist is concerned with both using the data and making it available at a granular level, while the data manager is concerned with the management and technical infrastructure. For example, the data scientist would analyze the data representation, code, and write the higher level policies for the management of the data. The data manager would ensure those policies are enforced, either at the machine-level or at the human level, and ensure the technical infrastructure storing the data meets preservation/curation best practices and standards.

What similarities and differences do you see between a data manager and a data scientist?

Jewel Ward
Follow Me

Jewel Ward

Founder and Consultant at Impact Zone Consultancy
Nice to meet you! I am Jewel Ward, founder of Impact Zone (www.impactzone.co). Our specialties are search engine optimization and digital stewardship for creative industry websites.

We work hard to serve our clients’ needs so they can solve their technology problems. Our goal is to enable our creative clients to succeed digitally in whatever form success means to them.

Here are three fun facts about me. I consider coffee and chocolate food groups. I am an INFJ. I love longboard surfing.
Jewel Ward
Follow Me

Latest posts by Jewel Ward (see all)

Please let me know what you think....