Data Related Terms (Definitions)
Data
A collection of facts.
Analysis
- The process used to make sense of the data collected.
- The goal of analysis is to identify trends and relationships within data so that you can accurately answer the questions you are asking for.
- There are 4 phases of Analysis:
- Organise data
- Format and adjust data
- Get input from others
- Transform data
- Organise data
- Format and adjust data
- Get input from others
- Transform data
Dataset
A collection of data that can be manipulated or analysed as one unit.
Data Type
A specific kind of data attribute that tells what kind of value the data is.
Database
A collection of data stored in a computer system.
Data Analysis
The collection, transformation and organisation of data in order to draw conclusions, make predictions and drive informed decision making.
Data Analyst
Someone who collects, transforms and organises data in order to produce, manage, store, organise, analyse and share data.
Data Ecosystem
The various elements that interact with one another in order to produce, manage, store, organise, analyse and share data.
Cloud
A place to keep the data online, rather than a computer hard drive.
Metric
Single, quantifiable type of data that can be used for measurement.
Metric Goal
A measurable goal set by a company and evaluated using metrics.
Data Science
Creating new ways of modelling and understanding the unknown using raw data.
Data Analytics
The science of data.
Data Driven Decision Making
Using facts to guide business strategies.
Data-Inspired Decision Making
Explores different data sources to find out what they have in common.
People Analytics
Also known as human resource analytics/workforce analytics.
It is the practice of collecting and analysing data on the people who make up a company's workforce in order to gain insights to improve how the company operates.
Report
Static collection of data given to stakeholders periodically.
Pros:
- High-level historical data
- Easy to design
- Pre-cleaned and sorted data.
Cons:
- Continual maintenance
- Less visually appealing
- Static
Dashboard
Monitors live, incoming data.
A dashboard is a tool that organises information from multiple datasets into one central location for tracking, analysis and simple visualisation.
Pros:
- Dynamic, automatic and interactive
- More stakeholder access
- Low maintenance
Cons:
- Labour-intensive design
- Can be confusing
- Potentially uncleaned data
Structured Thinking
The process of recognising the current problem or situation, organising the available information, revealing gaps and opportunities, and identifying the options.
Quantitative Data
Specific and objective measures of numerical facts.
The numbers are visualised as charts or graphs.
- The What ?
- How Many ?
- How Often ?
Quantitative data tools:- Structured interviews
- Surveys
- Polls
- Structured interviews
- Surveys
- Polls
Qualitative Data
Subjective or explanatory measures of qualities and characteristics.
Explains WHY the numbers are the way they are.
- The Why ?
Qualitative data tools:- Focus groups
- Social media text analysis
- In-person interviews
- Focus groups
- Social media text analysis
- In-person interviews
Pivot Table
- A data summarising tool that is used in data processing.
- They are used to summarise, sort, reorganise, group content, total or average data stored in a database.
- It allows users to transform rows into columns and vice-a-versa.
Formula
A set of instructions that perform a specific calculation using the data in a spreadsheet.
Function
A preset command that automatically performs a specific process or task using the data in a spreadsheet.
Operator
A symbol that names the type of operation or calculation to be performed.
Cell Reference
A cell or a range of cells in a worksheet that can be used in a formula.
Range of cells
A collection of two or more cells.
Problem Domain
The specific area of analysis that encompasses every activity affecting or affected by the problem.
Structured Thinking
The process of recognising the current problem or situation, organising available information, revealing gaps and opportunities, and identifying the options.
Statement Of Work (SOW)
It is a document that clearly identifies the products and services a vendor or a contractor will provide to an organisation.
It includes objectives, guidelines, deliverables, schedule, and costs.
Scope Of Work (SOW)
An agreed upon outline of the work you are going to perform on a project.
A scope of work may be included in a statement of work to help define project outcomes.
Context
The condition in which something exists or happens.
Stakeholders
People that have invested time, interest and resources into the projects you will be working on as a data analyst.
First Party Data
Data collected by an individual or group by using their own resources.
Second Party Data
Data collected by a group directly from its audience and then sold.
Third Party Data
Data collected from outside sources who did not collect it directly.
Population
All possible data values in a certain dataset.
Sample
A part of a population that is a representative of the population.
Discrete Data
Data that is counted and has a limited number of values.
Continuous Data
Data that is measured and can have almost any numeric value.
Nominal Data
A type of qualitative data that is categorised without a set order.
Ordinal Data
A type of qualitative data with a set order or scale.
Internal Data
Data that lives within a company's own systems.
External Data
Data that lives and is generated outside of an organisation.
Structured Data
Data organised in a certain format such as rows and columns.
Unstructured Data
Data that is not organised in an easily identifiable manner.
Data Elements
Pieces of information, such as people's name, account numbers and addresses.
Data Models
A model that is used for organising data elements and how they relate to one another.
Text Or String Data Type
A sequence of characters and punctuation that contains textual information.
Boolean Data Type
A data type with only two possible values, such as TRUE or FALSE.
Wide Data
Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject.
Long Data
Data in which each row is one time point per subject, so each subject will have data in multiple rows.
Data Bias
A type of error that systematically skews results in a certain direction.
There are three types of data bias :
- Sampling Bias
- Observer Bias
- Interpretation Bias
- Confirmation Bias
Sampling Bias
When a sample isn't representative of the population as a whole.
Unbiased Sample
When a sample is representative of the population being measured.
Observer Bias
The tendency for different people to observe things differently.
Interpretation Bias
The tendency to always interpret ambiguous situations in a positive or negative way.
Confirmation Bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs.
Good Data
- R- Reliable
- O- Original
- C- Comprehensive
- C- Current
- C- Cited
Bad Data
R- ReliableO- OriginalC- ComprehensiveC- CurrentC- Cited
Data Ethics
Well-founded standards of right and wrong that dictate how data is collected, shared and used.
GDPR
General Data Protection Regulation Of The European Union
Aspects Of Data Ethics
- Ownership
- Individuals own the raw data they provide and they have primary control over its usage, how it's processed and how it's shared.
- Transaction transparency
- All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
- Consent
- An individual's right to know explicit details about how and why their data will be used before agreeing to provide it.
- Currency
- Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
- Privacy
- Preserving a data subject's information and activity anytime a data transaction occurs.
- Openness
- Free access, sharing and usage of data.
- Individuals own the raw data they provide and they have primary control over its usage, how it's processed and how it's shared.
- All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
- An individual's right to know explicit details about how and why their data will be used before agreeing to provide it.
- Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
- Preserving a data subject's information and activity anytime a data transaction occurs.
- Free access, sharing and usage of data.
Data Interoperability
The ability of data systems and services to openly connect and share data.
Metadata
Data about data.
Metadata is used in database management to help data analysts interpret the contents of the data within the database.
Metadata is stored in a single, central location and gives the company standardised information about all of its data.
There are 3 types of Metadata:
- Descriptive
- Metadata that describes a piece of data and can be used to identify at a later point of time.
- Structural
- Metadata that indicates how a piece of data is organised and whether it is part of one, or more than one, data collection.
- Administrative
- Metadata that indicates the technical source of a digital asset.
Metadata Repository
A database specifically created to store metadata.
Metadata repositories make it easier and faster to bring together multiple sources for data analysis.
Metadata Repositories :
- Describe the state and location of the metadata.
- Describe the structures of the tables inside.
- Describe how the data flows through the repository.
- Keep track of who accesses the metadata and when.
Relational Database
A database that contains a series of related tables that can be connected via their relationships.
Primary Key
- An identifier that references a column in which each value is unique.
- Used to ensure data in a specific column is unique.
- Uniquely identifies a record in a relational database table.
- Only one primary key is allowed in a table.
- Cannot contain null or blank values.
Foreign Key
- A field within a table that is a primary key in another table.
- A column or group of columns in a relational database table that provides a link between the data in two tables.
- Refers to the field in a table that's the primary key of another table.
- More than one foreign key is allowed to exist in a table.
Data Governance
A process to ensure the formal management of a company's data assets.
Sorting Data
Arranging data into a meaningful order to make it easier to understand, analyse, and visualise.
Filtering Data
Showing only the data that meets a specific criteria while hiding the rest.
Naming Conventions
Consistent guidelines that describe the content, date or version of a file in its name.
Basically it means to use logical and descriptive names for your files to make them easier to find and use.
Data Security
Protecting data from unauthorised access or corruption by adopting safety measures.
Data Integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle.
Data Replication
The process of storing data in multiple locations.
Data Transfer
The process of copying data from a storage device to memory, or from one computer to another.
Data Manipulation
The process of changing data to make it look more organised and easier to read.
Random Sampling
A way of selecting a sample from a population so that every possible type of sample has an equal chance of being chosen.
Statistical Power
The probability of getting meaningful results from a test.
Hypothetical Testing
A way to see if a survey or experiment has meaningful results.
Confidence Level
The probability that your sample size accurately reflects the greater population.
Margin Of Error
The maximum amount that the sample results are expected to differ from those of the actual population.
To calculate margin of error you need :
- Population Size
- Sample Size
- Confidence Level
Estimated Response Rate
If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received your survey.
Dirty Data
Data that is incomplete, incorrect or irrelevant to the problem you are trying to solve.
Clean Data
Data that is complete, correct and relevant to the problem you are trying to solve.
Data Engineers
Transform data into a useful format for analysis and give it a reliable infrastructure.
Data Warehousing Specialists
Develop processes and procedures to effectively store and organise data.
Null
An indication that a value doesn't exist in a dataset.
Data Validation
- A tool for checking the accuracy and quality of data before adding or importing it.
- It basically allows you to control what can and can't be entered in your worksheet.
- Adds dropdown lists with predetermined options.
- Can create custom check boxes.
- Protect structured data and formulas.
Data Merging
The process of combining two or more datasets into a single dataset.
Compatibility
How well two or more datasets are able to work together.
Conditional Formatting
A spreadsheet tool that changes how cells appear when values meet specific conditions.
Remove Duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Text String
A group of characters within a cell, most often composed of letters.
Split
A tool that divides text around a specific character and puts each fragment into a new, separate cell.
Concatenate
A function that joins multiple text strings into a single string.
COUNTIF
A function that returns the number of cells that match a specified value.
Syntax
A predetermined structure that includes all required information and its proper placement.
LEN
A function that tells you the length of a text string by counting the number of characters it contains.
LEFT
A function that gives you a set number of characters from the left side of a text string.
RIGHT
A function that gives you a set number of characters from the right side of a text string.
MID
A function that gives you a segment from the middle of a text string.
TRIM
A function that removes leading, trailing, and repeated spaces in data.
VLOOKUP
VERTICAL LOOKUP
A function that searches for a certain value in a column to return a corresponding piece of information.
=VLOOKUP(data to look up, 'where to look'!Range, column, false)
VLOOKUP only returns the first match it finds.
VLOOKUP can only return the value from the data to the right.
TRUE tells VLOOKUP to look for approximate matches.
FALSE tells VLOOKUP to look for exact matches.
Data Mapping
The process of matching fields from one data source to another.
Schema
A way of describing how something is organised.
CAST()
This function can be used to convert one data type to another.
FLOAT
A number that contains a decimal.
Typecasting
Converting data from one type to another.
CONCAT()
Adds strings together to create new text strings that can be used as unique keys.
COALESCE
Can be used to return non-null values in a list.
Verification
A process to confirm that a data cleaning effort was well executed and the resulting data is accurate and reliable.
Changelog
A file containing a chronologically ordered list of modifications made to a project.
COUNTA
A function that counts the total number of values within a specified range.
CASE statement
The CASE statement goes through one or more conditions and returns a value as soon as the condition is met.
Documentation
The process of tracking changes, additions, deletions and errors involved in your data cleaning effort.
Sort Sheet
All of the data in a spreadsheet is sorted by the ranking of a specific sorted column - data across rows is kept together.
Sort Range
Nothing else on the spreadsheet is rearranged besides the specified cells in a column.
Customised sort order
When you sort data in a spreadsheet using multiple conditions.
R
A programming language frequently used for statistical analysis, visualisation, and other data analysis.
Data Aggregation
The process of gathering data from multiple sources in order to combine it into a single summarised collection.
Data can also be aggregated over a given time period to provide statistics such as
- Averages
- Minimums
- Maximums
- Sums
Functions help make data aggregation possible.
Subquery
A query within another query.
A SQL query that is nested inside a larger query.
VALUE
A function that converts a text string that represents a number to a numeric value.
Absolute Reference
A reference that is locked so that rows and columns won't change when copied. ($)
MATCH
A function used to locate the position of a specific lookup value.
JOIN
A SQL clause that is used to combine rows from two or more tables based on a related column.
INNER JOIN
A function that returns records with matching values in both tables.
LEFT JOIN
A function that will return all the records from the left table and only the matching records from the right table.
The table mentioned first is left and the table mentioned second is right.
RIGHT JOIN
A function that will return all the records from the right table and only the matching records from the left table.
OUTER JOIN
A function that combines right and left join to return all matching records in both tables.
COUNT in spreadsheets
Can be used to count the total number of numerical values within a specific range in spreadsheets.
COUNT in SQL
A query that returns the number of rows within a specified range.
COUNT DISTINCT
A query that only returns the distinct values in a specified range.
Basically as a data analyst to answer "how many" we use COUNT and COUNT DISTINCT.
Aliasing
When you temporarily name a table or column in your query to make it easier to read and write.
Because these names are temporary so they only last for the particular query.
HAVING
Allows you to add a filter to your query instead of the underlying table that can only be used with aggregate functions.
CASE
Returns records with your condition by allowing you to include if/then statements in your query.
Summary Table
A table used to summarise statistical information about data.
SUMIF
A function that adds numeric data based on one condition.
Conditional Functions
These are the functions that perform a specific task, but only on cells that satisfy some defined criteria.
They are usually identified with an IF suffix adjoined to the desired operation.
They are usually used when constructing some complex queries that cannot be accomplished using more basic functions.
SUMPRODUCT
A function that multiplies arrays and returns the sum of those products.
Array
A collection of values in cells.
Profit Margin
A percentage that indicates how many cents of profit has been generated for each dollar of sale.
Calculated Field
A new field within a pivot table that carries out certain calculations based on the values of other fields.
Modulo Operator (%)
An operator that returns the remainder when one number is divided by another.
GROUP BY
A command that groups rows that have the same values from a table into summary rows.
EXTRACT
Lets us pull one part of a given date to use.
Temporary Tables (Temp Tables)
- A database table that is created and exists temporarily on a database server.
- They store subsets of tables from standard data tables for a certain period of time.
- They are automatically deleted when you end your SQL database session.
- They are useful when you only need tables for a short period of time for analysis like calculations.
- Can be create using
- WITH clause
- SELECT INTO statement
- CREATE TABLE statement
WITH clause in SQL
It is a type of temporary table that you can query from multiple times.
SELECT INTO in SQL
This statement copies data from one table to another table but it doesn't add the new table to the database.
Its useful when you want to make a copy of a table with some specific conditions like the WHERE clause.
BigQuery doesn't recognise the SELECT INTO command.
CREATE TABLE
This statement does add a new table to the database so that everyone can have access to the table.
Bar graphs
Use size contrasts to compare two or more values.
Line graphs
Help your audience understand changes or shifts in data.
Pie Charts
Show how much each part of something makes up the whole.
Maps
Help organise data geographically.
Histogram
A chart that shows how often data values fall into certain ranges.
Correlation charts
Show relationships among data.
Causation
Occurs when an action directly leads to an outcome.
Static Visualisations
Do not change overtime unless they are edited.
Dynamic Visualisations
Visualisations that are interactive or change over time.
Tableau
A business intelligence and analytics platform that helps people see, understand and make decisions with data.
The elements of Art
- Line
- Shape
- Color
- Space
- Movement
Data Composition
Combining the individual parts in a visualisation and displaying them together as a whole.
Elements For Effective Visualisation
- Clear meaning
- Sophisticated use of contrast
- Refined execution
Design thinking
A process used to solve complex problems in a user-centric way.
Diverging Color Palette
Displays two ranges of values using color intensity to show the magnitude of the number and the actual color to show which range the number is from.
Data Storytelling
Communicating the meaning of a dataset with visuals and a narrative that are customised for each particular audience.
The 3 data storytelling steps:
- Engage your audience
- Create compelling visuals
- Tell the story in an interesting narrative.
Spotlighting
Scanning through data to quickly identify the most important insights.
Coding
Coding is writing to the computer in the syntax of a programming language.
Functions (R)
A body of reusable code used to perform specific tasks in R.
Argument (R)
Information that a function in R needs in order to run.
Variable (R)
A representation of a value in R that can be stored for use later during programming.
Vector (R)
A group of data elements of the same type sorted in a sequence in R.
Pipe (R)
A tool in R for expressing a sequence of multiple operations, represented with "%>%" .
Operator (R)
A symbol that names the type of operation or type of calculation to be performed in a formula.
Assignment Operator (R)
They are used to assign values to variables and vectors.
Arithmetic Operator (R)
They are used to complete math calculations.
Packages (R)
They are units of reproducible R code.
Packages includes:
- Reusable R functions
- Documentation about the functions
- Sample datasets
- Tests for checking your code.
Tidyverse (R)
A system of packages in R with a common design philosophy for data manipulation, exploration, and visualisation.
The 8 core tidy verse packages:
- ggplot2
- tibble
- tidyr
- readr
- purrr
- dplyr
- stringr
- forcats
ggplot2 (R)
Create a variety of data viz by applying different visual properties to the data variables in R.
tidyr (R)
A package used for data cleaning to make tidy data.
readr (R)
This is used for importing data.
dplyr (R)
Offers a consistent set of functions that help you complete some common data manipulation tasks.
Factors (R)
Store categorical data in R where the data values are limited and usually based on a finite group like country or year.
Nested (R)
In programming, describes a code that performs a particular function and is contained within code that performs a broader function.
Nested Function (R)
A function that is completely contained within another function.
Data Frame (R)
A collection of columns.
Tibbles (R)
- They are like streamlined data frames.
- They never change the data types of the input.
- They never change the names of your variables.
- They never create row names.
- They make printing in R easier.
Tidy Data (R)
A way of standardising the organisation of data within R.
Anscombe's Quartet (R)
Four datasets that have nearly identical summary statistics.
Aesthetic (R)
A visual property of an object in your plot.
Geom (R)
The geometric object used to represent your data.
Facets (R)
Let you display smaller groups, or subsets of your data.
For inspiration on data analysis visit : tableau
For Dashboard Requirements Gathering Worksheet visit : Requirements worksheet
Comments
Post a Comment