What happens when a clever hack meets a difficult problem? Sometimes you might get clever solutions, but most times, you get anti-patterns. The cop-out is to call an anti-pattern a "bad programming practice," but nobody sets out to program badly: They just have a problem and query Stack overflow and GitHub with some well-crafted keywords and get the right snippet of code to resolve their issue, and they keep after it until they have some real choice anti-patterns.
An anti-pattern generally is a solution that looks reasonable at first glance, but in time causes issues. Data governance is in its best incarnation a set of practices to avoid these anti-patterns.
To understand the nature of bad practices, let's break down good data practices by outlining what it is.
Data governance covers the integrity, quality, and security of data throughout an organization. Data comes in, data gets processed, and data goes out, and the only time it gets screwed up is when it comes in, gets processed, and goes out.
Let's start with data coming in. A good practice consolidates data sources to a well-defined set of input stages and streams. At an organizational level, the presence or absence of data is visible. Aha, a chance for an anti-pattern. What about the clever data practitioner that uses spreadsheet functions or macros to import a source directly into a spreadsheet. Brilliant. It's a quick and easy hack and it's based on the individual, not the organization. Here's a great example:
Say my organization is tracking the 2020 election, and I'd like to import 538 senate poll data. My engineering team won't be able to get me the data in time, so I embed the following in my spreadsheet.
=ImportCSV("https://projects.fivethirtyeight.com/polls-page/senate_polls.csv")
Anti-pattern implemented. It solved my problem, I got the data, but the end result is that I've prevented someone from checking or monitoring this point, and I've also precluded any tracking of the evolution of this data.
How to fix it? For starters, you need visibility on the entry points of data in the organization. You need to use the organizational staging location - a location where an operations team can define when and how much data to expect. A location that can be monitored so if the data stops or is incorrect, it can be addressed.
Build a data pipeline operation to gather data, as is, into a central location
Make it simple
If your data scientists feel they need to work-around a slow engineering team, then your organization has a data governance problem
Invest in a good data pipeline application. Give your non-engineers the ability to add unprocessed data sources into a staging location and invest in monitoring your data staging.
For AWS, I recommend S3. It integrates into a wide array of security levels, encryption, and tools. It is extremely simple for anyone to use.
Invest in a good database and don't use spreadsheets to process data. SQL is an incredibly powerful language and a good database will scale to billions of rows of data, Terabytes of data.
For the above, anti-pattern, Replace the ImportCSV with a copy to a stage table. Copy every file you load into an archive, but keep all the data in your database. Data is cheap, processing data is expensive. Process data once, keep it forever.
Snowflake Best Practice
Copy the Data into the Stage Table with the table name for uniqueness on row and don't delete or overwrite this data
copy into OWLMTN.STAGE.SENATE_POLLS (FILENAME,
FILE_ROW_VARCHAR,
question_id, ...
candidate_party,
pct
)
from (
select METADATA$FILENAME filedate,
METADATA$FILE_ROW_NUMBER filename,
t.$1, ...
t.$37,
t.$38
from @owlmtn.stage.FIVE_THIRTY_EIGHT_POLLS t
)
pattern = 'stage/fivethirtyeight/polls/SENATE_POLLS_.*\.csv'
on_error = continue
force = false
file_format = (field_optionally_enclosed_by = '"'
type = 'csv'
field_delimiter = ','
skip_header = 1
encoding = 'utf-8');
Conclusion
The anti-pattern often only becomes visible at scale. If I'm creating a one-off spreadsheet for just myself, then the ImportCSV is fine, but data governance is about governing data in an organization, and managing incoming data is a critical foundation to a good data strategy. Even for the individual practitioner, however, this pattern is worth the trouble to follow. The election is next week and the one-off data may be useless after November 3, but if you ever want to go back and validate your models over team, the historical record is critical.
Comentários