Senior Data Engineer – Featured Merchant Algorithm
Amazon’s planet-scale Retail platform has created the largest marketplace in human history. Affording customers unprecedented product selection and merchants access to a global market, our team leverages sophisticated machine learning and big data technologies to allow customers to discover the right product at the right price from the most trusted merchant billions of times every day. We enable over 90% of all purchases on Amazon by choosing which offer wins the Detail Page Buy Box and our service directly impacts the business metrics of every Amazon channel (Retail, FBA and Marketplace) worldwide. If you’re looking for a career-defining opportunity on one of the most visible teams within Amazon, we’d love to hear from you.
Our success depends on our ability to manage and analyze the data that our customers generate. We are looking for an outstanding engineer with a great business sense who has the ability to analyze and understand large amounts of data and help make data-driven key strategic decisions that will drive several customer focused initiatives. Working with our science, engineering and multiple business teams, you will have the opportunity to impact customer experience, design, architecture, and implementation of multiple customer friendly features on Amazon.
In this position, you will be working in one of the world’s largest and most complex data warehouse environments. You should be passionate about working with huge datasets and be someone who loves to bring datasets together to answer business questions. You should have deep expertise in creation and management of datasets and the proven ability to translate the data into meaningful insights. In this role, you will have ownership of end-to-end development of solutions to complex questions, and you’ll play an integral role in strategic decision-making.
The right candidate will possess excellent business and communication skills, be able to work with business owners to develop and define key business questions, and be able to build analyses to answer those questions. You will have regular interactions and will present to senior leaders in Amazon
- Bachelor’s or Master’s degree in Computer Science Mathematics, Statistics, Finance or related technical field.
- 4+ years of relevant employment experience.
- Knowledge and direct experience using at least one industry standard business intelligence reporting tool.
- Experience in gathering requirements and formulating business metrics for reporting.
- Excellent knowledge of Oracle SQL and Excel.
- Experience using SQL, ETL and databases in a business environment with large-scale, complex datasets.
- Strong verbal/written communication & data presentation skills, including an ability to effectively communicate with both business and technical teams.
- Previous E-Commerce Experience
- Preferred Experience managing scorecards, metrics dashboards
This is the word that comes to mind while reading the venerable Dr R.Kimball. With his flamboyant style he wouldn’t last a week at my job. It is too bad that OpenAmplify removed its free web app. It would be interesting to run his text through their API and see the scores.
It looks like good tech writing should work really well with be conducive to knowledge extraction into RDF, and, consequently, knowledge exploration via SPARQL or OWL. Which begs the question: “Should we write for humans or for machines?”. From my observations, if machines understand a piece of text, then humans will definitely do.
It is a known practice to re-factor code by its change velocity. Ideally, source code should be resilient to change, and volatile logic should go into a configuration layer (config files or, better, “convention over configuration”).
A similar pattern is known in the DW world. Separation of facts from dimensions is just a single use case of consolidating / grouping / separating data by their change velocity.
Slowly changing dimensions are another example. There are at least two classes of dimensions – static and slowly changing.
Are there fast changing dimensions? Do we call them facts?
The common enumeration of SCDs using types 0-7 encodes a single attribute — where the history is stored.
I think there is an obvious pattern for types 0-4 (and emerging mnemonics):
Type 0: history is stored nowhere.
Type 1: history is stored in the dimension itself, in a single current row (history with length of 1 means – no history).
Type 2: history is stored in the dimension itself, in extra rows (history length is 2+)
Type 3: history is stored in the dimension itself, in extra columns (something about 3rd dimension?)
Type 4: history is stored in a separate table.
Summary: pivoted time series allow trade-offs between speed, space, convenience, scalability.
Many scientists and analysts (a.k.a. humans) visualize time series horizontally, where the time axis goes from left to right and values — parallel to it. The series array is often sparse, e.g. we have no data point for January 2, but have to allocate an array element, so January 1 and January 3 are two days apart. In terms of RDBMS: the time information is stored in columns (even column names), values are stored in rows, which is worse since columns are static and defined via DDL.
This is approach is intuitive and friendly to humans, but not to databases.
Sparse data do waste space. In the world of databases wasted space = wasted time.
Dates as columns are rigid and require developers to hard-code dates in analytic SQL.
An alternative approach is to pivot the model: the time axis goes to a single column, values – to separate columns too.
Every row corresponds to a single point in time and contains values from all the column (e.g. forecast at P50, P80)
The data become dense. Since dates are not a part of the model (no fixed columns) – there is no need to keep empty rows for missing data. The absence of hard-coded dates make SQL simple, compact and makes it easier to run analysis in a moving window of data range.
Shouldn’t German tactics work better than Oriental methodologies?
At least in case of “culturally American” developers?
In such a case the next logical step should be – implement a Blitzkrieg methodology.
Since Hive 0.11 and its analytic functions are not available for my current project I have to resort to simple remedies.
Such as a massive union of top-N queries:
select * from (
select * from (select * from ( select * from v_diff where group_id = 1 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 2 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 3 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 4 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 5 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 6 ) aa order by abs(diff) desc ) bb limit 100
select * from (select * from ( select * from v_diff where group_id = 7 ) aa order by abs(diff) desc ) bb limit 100
Where v_diff is a simple view comparing two partitions in a single table:
Contrary to my expectations – it took a while to complete the query.
According to job tracker, all the individual select statement went into a single queue.
A small change in the hive session settings made a difference.
With parallel execution turned ON – Hadoop launched 16 queries in parallel and completed the whole select much faster.
Something I would take for granted in an Oracle database with parallel execution.
For an average human it is hard to fathom the volume of data he deals with.
The notion of “a lot of data” changes with Moore’s law and highly subjective: from a stack of punch cards to a rack of hard drives.
A gigabyte used to mean “a lot of data”.
What an average human could fathom is his personal perception of how fast it takes to process data.
Thus, while working with Hadoop based technologies I couldn’t help noticing — how long it takes to process small samples — comparing to relational databases.
Which is the small (but annoying) price to pay for the overwhelming speed of processing “a lot of data”.
That is why it is critical to run pig in local mode when going through tutorials.
pig -x local
This is not a new problem
“I am having a problem starting my BI server (Red Hat Linux) – when I view the NQServer.log file after starting the server, I see that it’s loading each of my subject areas. It has a message saying Finished loading … for each subject area. However, it’s not getting anywhere past that. Usually it will say the server has been started, but I am not seeing that.”
There are few problems in life that cannot be solved using the KIWI principle (kill it with iron).
In our case a poor Linux host did not have enough memory to load a bloated RPD and start a server fast enough.
Switching to from 4 to 8 Gb of RAM solved the problem.
The empirical evidence (and a KIWI workaround) are not enough, though. That system deserves some real profiling.
We had a team exercise to design a simplest-possible-but-still-useful BI dashboard / report / KPI.