How to Get Good Training Data in Machine Learning?

Data Annotation Service — From the Backstage to the Front Stage

ByteBridge.io

6 min readDec 4, 2020

“Have you heard about the AI industry?”

9 out of 10 people will probably say yes.

“Do you know data-annotation?”

This time, 9 out of 10 people will probably shake their heads.

Unlike AI companies at the center of the spotlight, the data annotation industry has been in the gray area for a long time, in a low-profile status.

However, with the increasing of refined demands, the data annotation industry is undergoing rapid changes, moving from the background to the foreground.

Annotation service

Data annotation technique is used to make the objects recognizable and understandable for machine learning models. It is critical for the development of machine learning (ML) industries such as face recognition, autonomous driving, aerial drones, and many other AI and robotics applications.

Behind the scenes: extensive and chaotic interweaving

There is a saying in the data labeling industry: “more artificial intelligence, more human workforce”

In a way, it is something about the nature of artificial intelligence.

Supervised learning is still the most effective way for AI to improve their cognitive abilities, and almost all the training data that AI algorithms can learn from are manually annotated one by one.

Demand means the market. The potential lucrative market attracts so many people who want to take a piece of the action, among whom, small and large-sized labeling teams blossom.

However, problems arise.

Different from the high-tech, data annotation is still a labor-intensive industry, and the service mode is usually outsourced.

Annotators are involved in repetitive work such as drawing bounding boxes and points every day. The uneven level of labor force leads to the low quality of outputs, which cannot meet the needs of AI enterprises and affects the commercialization process of AI products.

At the same time, the low-end production capacity also makes the data labeling industry waived from any barriers and restrictions. Many labeling teams of a few people can start the business after simple training.

As a result, the industry is in constant chaos and competition. Most labeling teams are at the bottom of the industrial chain and under cost pressure due to price reduction.

Foreground: AI’s reliance on high-quality data

There is an important consensus in the AI industry:

The quality of the training data directly determines the performance of the final AI model.

In other words, the more scalable and accurate data, the better the model will work out, the more robust algorithm will be.

With the accelerated commercialization process of AI enterprises, more and more enterprises begin to realize the importance of training data.

Take the autonomous driving case for example:

Many companies have produced prototypes of their driverless cars, which frequently show up in public. However, although these prototypes perform well in the laboratory, they are still far from commercial landing. One important reason is that the gap between the real road situation and the laboratory ones is too large.

In the laboratory, only a small amount of road data is needed to meet the needs of the experiment. However, on the real road, driverless cars will encounter many unpredictable situations. Without sufficient data support, build-in AI model cannot make their judgments, which leads to a dramatic increase in accident risks.

Therefore, many autonomous driving enterprises have put forward higher expectations on data annotation. Therefore, the data annotation industry has been up in the spotlight, moving from the back to the foreground.

Future: Intelligence, refinement, and scenario-based

As is known to all, the three basic elements of artificial intelligence are algorithm, processing, and data, among which data is the cornerstone.

With the acceleration of the commercialization process, the AI data service is evolving. In the future, intelligence, refinement, and scenario-based will be the main development direction.

Intelligence

Intelligence means the AI-assisted annotation tool. AI pre-processing technology can automatically recognize and translate speech data, and the annotator only needs to make some modifications to the initial results, which not only improves the efficiency but also reduces the dependence on human resources.

Refinement

Refinement means requirements in detail. The previous accuracy level was 90%, which could meet the requirements of clients. Now the rate reaches up to 95%, some even more than 99%.

Scenario-based

Scenario-based means that the data annotation industry needs to meet the requirements of various application scenes.

Let’s take computer vision as an example. Currently, data annotation can be applied in automatic driving, drones, AI education, industrial robots, new retail, safety protection, and other scenarios. Each application scenario has its data type and specific labeling requirements, which challenges data labeling enterprises' ability.

It is likely that the data annotation industry will usher in a big change in the next few years. The data service providers with more advanced technologies and more professional services will come out in the new era.

ByteBridge: a Human-powered and ML-powered Labeling Platform

Bytebridge, a human-powered and ML-powered data labeling platform, provides high-quality services to collect and annotate different types of data such as text, image, audio, and video to accelerate the development of the machine learning industry.

Quality Guarantee

ML-assisted capacity can help reduce human errors by automatically pre-labeling
The real-time QA and QC are integrated into the labeling workflow as the consensus mechanism is introduced to ensure accuracy
Consensus — Assign the same task to several workers, and the correct answer is the one that comes back from the majority output
All work results are completely screened and inspected by machines and the human workforce

ByteBridge, a Human-powered and ML-powered Data Labeling Tooling Platform

In this way, ByteBridge can affirm our data acceptance and accuracy rate is over 98%.

Flexibility — More Engaged in the 2D Images Labeling Loop

ByteBridge, a human-powered and ML-powered data labeling tooling platform with real-time workflow management, provides training data for the machine learning industry.

On the dashboard, clients can set labeling rules, iterate data features, attributes and workflow, scale up or down, make changes based on what they are learning about the model’s performance in each step of test and validation.

You can choose Bounding Box and Classification Template:

ByteBridge Data Labeling Platform Tutorial: Bounding Box and Classification Template Updated

These labeling tools are available on the dashboard: Image Classification, 2D Boxing, Polygon, Cuboid.

3D Point Cloud Annotation Service

ByteBridge self-developed 3D Point Cloud labeling, quality inspection tool, and pre-labeling functions can complete high-quality and high-precision 3D point cloud annotation for 2D-3D fusion or 3D images provided by different manufacturers and equipment, and provide one-station management service of labeling, QA, and QC.

More info: ByteBridge Launches World’s First Mobile 3D Point Cloud Data Labeling Service

ByteBridge 3D Point Cloud Annotation Tool

3D Point Cloud Annotation Types:

Sensor Fusion Cuboids: 49 categories include car, truck, heavy vehicle, two-wheeled vehicle, pedestrian, etc.
Sensor Fusion Segmentation: obstacles classification, different types of lanes differentiation
Sensor Fusion Cuboids Tracking

① Tracking the same object with the same ID, labeling the leaving state;

② Point clouds or time-aligned images could be provided, point clouds outputs only.

Advantages of Our 3D Point Cloud Annotation Service:

· Support 2D to 3D mapping, support multiple cameras

· Support scalable data annotation

· AI-assisted tool — Pre-labeling

· QA & QC Platform

Cost-effective

A collaboration of the human-work force and AI algorithms ensure a 50% lower price compared to the conventional market.

End

Thomas C. Redman sums up the current data quality challenge in this way: “Increasingly complex problems demand not just more data, but more diverse, comprehensive data. And with this comes more quality problems.”

ByteBridge is dedicated to empowering the machine learning revolution with no bias training data. We can provide personalized annotation tools and services according to customer requirements.

If you need data labeling and collection services, please have a look at bytebridge.io, the clear pricing is available.

If you would like to have a look at the 3D point cloud live demo, please feel free to contact us: support@bytebridge.io