Migrating the User Group Data – Part 1: Extraction

The website of the .Net User Group Bern worked without much change for nearly 10 years. A refresh was necessary and that gave us two options: Just use a shiny new design or use it as an opportunity to learn new skills and interesting technology. We went with the later one and one of the learning opportunities I picked was the migration of all the data. Our small data set is great to try out different approaches and I hope my post can help you with much bigger migrations as well.

This post is part one of a small series on the data migration task:

As the names of the different parts suggest, the migration task uses the classic ETL approach (Extract, Transform, Load) for data migrations. ETL was introduced in the 1970s and is widely used. We used this approach ourselves for various projects, but we were never allowed to talk about it in detail. This time there are no restrictions and we are happy to share all the steps with you.

 
The old website

 

Step 1: Identify the data to migrate

Figuring out what exactly must be migrated was not as straightforward as I initially thought. The website itself integrated two different data sources and the events of the last 3 years where only on Xing (a site like LinkedIn). To have all invitations for the user group events in Xing made counting them an easy task.

From those 80 events I estimated the amount of talks (1.5 talks per event, ~120 events) and the number of speakers (50% of the talks, ~60 speakers). That would be too many to manually enter them into the new system. A (semi-) automated approach for the migration was therefore the way to go forward.

This ballpark-estimate was not far off compared to the numbers that where migrated at the end:

  • 80 events
  • 94 talks
  • 57 speakers

 

Step 2: Identify all the sources

The user group website is a great example on how applications evolve. It started with single HTML page for the events in 2010, evolved to a much more structured approach using JSON and a dynamically generated navigation up to a point in which the solution no longer worked and was abandoned. This left us with these 3 sources for the data:

  • 2010: HTML on the website
  • 2011 – 2015: JSON on the website
  • 2010 – 2019: HTML on Xing

 

Step 3: Unify into one source

If we only look at the sources identified in step 2, then Xing would be a great place to take the data from. It is the only complete source and we just need to write a bit of code to parse HTML. However, parsing HTML is never a simple task and Xing offers its own challenges. I expected a quick first success and then an endless fight with edge cases that in the end could not be parsed at all.

The HTML page of the website was no help either. The data there was the same as in Xing, but just for 2010.

The JSON file offers structure but lacks data. However, it is a simple task to add the missing events, talks and speakers. Therefore, I choose the JSON file as the source I would use to unify the data. It took my 2 hours to add all the missing events into the JSON file. It was repetitive, but it worked without any surprises. A long list of entries like this one was the first milestone of our migration:

 

Step 4: Validate

With all the data in a format that could be easily worked with, I started to make a simple prototype. By using the Visual Studio feature Paste JSON as Class I could generate all the pluming code and could focus on playing with the data. That helped me to identify missing data and some swapped Ids.

Those prototypes helped us to experiment with ideas like the speaker wall and influenced our design for the objects in the new solution.

 

Next

The extraction phase ends with all data in a single JSON file. The next part of this series is about transformation, in which the old format is cleaned up and reduced on the format we want to load into the new application.

1 thought on “Migrating the User Group Data – Part 1: Extraction”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.